Usually in this blog we write about text analysis products such as lemmatizers or parsers and how they can help to solve issues in products that need an accurate understanding of text to function. But today, we want to show you also what is behind our technology, how we are able to create it. That is why we decided to interview one of our expert linguists, Clara Garcia, to provide some insights.
First of all, what have you been working on lately?
One of our team's latest projects involved creating a morphological analyzer for Tagalog.
Can you explain what a morphological analyzer is?
In inflected languages, words are formed through morphological processes such as affixation. For example, by adding the suffix ‘-s’ to the verb ‘to dance’, we form the third person singular ‘dances’.
A morphological analyzer assigns the attributes of a given word by evaluating what morphological processes the form has undergone. If you give it the word ‘bailaré’ in Spanish, it will tell you it is the first person, singular, simple future, indicative form of the verb ‘bailar’.
A tool like this involves analyzing the grammar of the language, creating morphological models of how each POS inflects, and then creating a software and adapting those models to automatically detect what attributes will be assigned to a particular form of the language. In this case I will talk specifically about Tagalog since we just developed a morphological analyzer for it.
What’s the difference between creating this analyzer for frequently used languages like English and other less known ones?
The first difference between common and more “exotic” languages is the amount of literature and resources you can get. Finding enough literature to create all morphological models for an “exotic” language can be challenging.
Apart from that, the creation of the tool depends on the language inflectional system and whether it is very complex or relatively simple. It is doable in both cases, but a more complex inflectional system will also require a more complex software. Both the English and Tagalog inflectional systems are manageable enough to create a fine-grained analyzer.
Can you tell us more about Tagalog and its particularities for a better understanding of the creating process?
Tagalog is mainly spoken in the Philippines and it belongs to the Austronesian family. As I said, its inflectional system is manageable, only verbs and pronouns inflect. I will focus on verbs that are a bit more complex. They inflect to mark aspect, focus/voice and mood, and they do it through affixation and reduplication.
Tagalog is different from other languages in that it uses reduplication to mark aspect (most languages that use it, do so to mark intensity, form plurals, or for onomatopoeia among some other uses).
It also has a rich affixation system with suffixes, infixes, prefixes and circumfixes, that mark focus and mood. It is therefore, an interesting language for this kind of morphological tool.
I will show a couple examples of the steps to follow in Tagalog:
For the contemplated form of the verb ‘to eat’ (seen in the table above), we follow these steps:
- Find the stem -> kain
- Find the first consonant (C) and vowel (V) of the stem -> ‘k’, ‘a’
- See if the form matches the structure: C V + stem -> k-a-kain
For the progressive form of the verb ‘to read’, we follow these steps:
- Find the stem -> basa
- Find the first consonant (C) and vowel (V) of the stem -> ‘b’, ‘a’
- Find the affix -> ‘mag’
- Since the prefix is ‘mag-‘, and the aspect is progressive, ‘mag-’ changes into ‘nag-’ -> mag > nag
- See if the form matches the structure: Prefix + C V + stem -> nag-b-a-basa
Can you provide examples of some difficulties you faced during the process?
Like with most languages, the hardest part of analyzing morphologically is to cover all the possible phonological processes in the language. It is challenging to account for all these, particularly the least productive ones, what we commonly call exceptions.
One example of these phonological processes in Tagalog would be the roots that have ‘o’ as the vowel in the final syllable. They change 'o' into ‘u’ when a suffix is added: ‘suntok’ > ‘suntukin’. To account for all these, when there can be dozens of these processes is one of the main difficulties I found.
Lastly, the most important question: what are the applications of a tool like this?
This tool can be helpful for many tasks, some but not all related to NLP.
In the ‘assignment of attributes’ process the word is lemmatized and stemmed. Knowing the reduplication and/or affixes that apply to a word, we can find its lemma. This is useful for many NLP processes, for example concordances or POS tagging.
It can also be applied toward search engines so when you look up an inflected verb or noun it finds its lemma and suggests everything in that field. For example, if you look up something about ‘walking’, you will get more results if the search engine is able to know that the verb base form is ‘walk’, and from there it accesses the whole verb paradigm
This tool can also help us with indexation of databases and therefore with information retrieval. We don’t just have words but also their attributes, lemma, and stem. We can easily access specific elements through all this information.
A morphological analyzer can also be used as part of a machine translation system, reducing the complexity of the input and helping to understand the syntax. The words become a bag full of information pieces (lemma + tense + aspect + person etc.)
If you want to replicate our morphological analyzer download our presentation with the python script and some examples: