Almost three years after Apple launched its well-known voice assistant Siri for the Arabic language, there is still room for further improvement. Siri can currently understand more than 20 languages and dialects; but, when it comes to Arabic, its abilities are not good enough to fully understand what users need. Several utterance errors together with poor understanding skills are quite frustrating for Arabic speakers. What’s going wrong here?
Not long ago, Fuad Al-Attar, Executive Vice President in Industry Customer Services of Siemens UAE, wrote an article called: “Why Couldn’t Siri Speak Arabic Properly”. After reading this, Bitext team wanted to go further by exposing how Bitext technology could solve most of the problems Siri must face when dealing with Arabic linguistics. In his article, Fuad Al-Attar summarizes the most significant reasons why Arabic languages cannot be properly processed by a machine learning system:
What are the main challenges for Siri?
- Letter-shape features
The shape of a letter in the Arabic language varies depending on its position in a word. As he points out, the letter ح would look like حـ if placed at the beginning of a word; like ـحـ in the middle; and like ـح when located at the end. Although this issue is considered a problem for virtual assistants, in fact, it is not. Computers perceive every letter as a code or number, therefore, those three letters will be understood as a single one that can be rendered through different representations.
- Variation in syntactic structures
There are several language variations of Arabic: Classical, Modern Standard and Dialects. It is important to know how different the syntax of the sentences and the word position in these three language variations are. This fact becomes, therefore, a challenge for a general parsing technique. To solve this problem, Siri must identify every variation as an independent language with its own syntactical and lexical features the same way it does when dealing with Chinese from China or Chinese from Taiwan.
- Multiple words, one lemma
Prefixes, vowels or suffixes may be added to words changing their meaning and POS-tagging. This concern can be easily solved just by having a lexicon reaching all word inflections with their part-of-speech and all grammatical characteristics. The same happens in many other languages. Even in English, we have quick (adjective) and quickly (adverb) whose meaning, and POS differ just by adding a suffix to the first one.
- Lack of a named-entity corpus
Despite many intents, results are not as good as expected. Considering that Arabic texts do not use capital letters, it is impossible for such an automated system to recognize those entities. Therefore, the most suitable way to obtain such corpus would be through word lists and manual procedures.
- Emphatic and diacritic marks
The use of the Shadda and Hamza is usually ignored in written text hindering their capability to be correctly parsed. The same occurs with subscript and superscript diacritic symbols. Nevertheless, every marked word can be detected as a word itself – not necessarily related with the one without that emphatic marks.
- Transliteration of named-entities
There is a clear trend to have many transliterated forms of proper names in Arabic text what makes it more difficult for a machine to recognize them. Once again, every form of a word can be classified under the same lemma so that a machine can recognize every one of them as for being the same.
- Right-to-left script writing without caps
These well-known features of Arabic language make it more complex for usual algorithms to recognize proper names or sentence beginnings. While it is true that if there are no periods to separate sentences, it will be very difficult for a system to detect the beginning and the end of every sentence, the right-lo-left script will not pose any problem for a machine because the whole input will be internally perceived from the first to the last word; the visual representation has no effect on the data processing procedure.
- Subject ellipsis
A phenomenon that will rarely happen in the English language is something quite common for Arabic, Spanish, and many other languages. As far as these other languages are concerned, this omission of the subject pronouns does not refrain a conversational agent from understanding a query.
- Ambiguous words and sentence structure
Words can be segmented in multiple ways giving room to multiple meanings. The same occurs in a syntactic context when it comes to analyzing a sentence presenting different structures. This problem can easily be solved by means of a linguistic or context-specific analysis carried out by the machine itself.
- Prepositional additions
Some prepositions can be added to a sentence leading to a change in the meaning and interpretation of the message. Arabic is not the only language that shows this issue; the same happens when dealing with German, for instance. A characteristic feature of the German language is its ability to create verbs with new meanings through the addition of prefixes to nouns, adjectives, or other verbs. The same way this issue was fixed for German, it could be also tackled at a syntactical and lexical level for the Arabic language.
At this point, rather than focusing only on machine learning when dealing with the understanding skills of a bot, implementing solutions based on NLP techniques, as those offered by Bitext, is far more helpful. Even having a huge corpus, including all variations and statistical models, may not be a highly effective alternative, but using a proper lexicon and syntax will allow developers to have control over it. In this respect, the most suitable solution would be a hybrid system based on both ML and linguistic techniques. Bitext technologies, based on a linguistic approach, facilitate those virtual assistants to understand more accurately what humans say giving, at the same time, more appropriate answers to their queries.
Do you want to learn more? Take a quick look at Bitext multilingual solutions by clicking here.