Whenever we search for something on the internet, something along the lines of lemmatizing words, for example, you’d probably get better search results if you include also different inflectional forms (lemmatize, lemmatizers, lemmatized, word, etc.). Well, that’s where lemmatization comes in.
Lemmatization is a linguistic processing task that makes other systems work better by grouping together morphologically similar forms, making document retrieval (among many other systems) much more efficient.
When we lemmatize a word, we obtain its base form, the one that would typically appear in a dictionary. If it’s a verb, the lemma is the infinitive. If it’s a noun, it’s the singular form.
Now that’s all fine and dandy for English, which barely has any inflection compared to other languages. However, there are more synthetic languages such as Spanish, where for the verb "comer", we have several different inflected forms depending on the person, number, mood (comes, coméis, comáis, etc.) or Arabic, where verb inflections change depending on gender as well, for example, the verb "to eat: أكل (‘akal)" has two different forms for the third person singular "تأكل (ta’kol)" if the subject is feminine and "يأكل (ya’kol)" if it’s masculine. Furthermore, in some languages such as Italian, even some prepositions have different inflected forms (sullo, sulla, sul). As you can see, lemmatizing in these languages is a much more complex task. As an added difficulty, in most, if not all languages there are morphological inconsistencies between some of the forms. For example, in Spanish, some inflected forms of verbs such as "ir" don’t resemble their infinitive or base forms at all (voy, fuera, íbamos).
Of course, lemmatization isn’t only useful for document retrieval; let’s suppose you want to train a chatbot, a domestic one to help you around the house. It’s important for the chatbot to know that in some cases, the plural and singular forms refer to the same thing, such as "turn on the light!" And turn on the lights! An important aspect to take into account is that many artificial intelligent products such as chatbots require training data to function more efficiently. The more data they have, the better they run. But as we all know, training takes time, and time is always of the essence; incorporating a lemmatizer into the mix could prove to be an effective way to elegantly cut corners and save time, especially when more synthetic languages. Nevertheless, before lemmatizing, we should probably take a minute to consider how fine-grained we want our training data to be.
It’s obvious that nowadays we’re always exchanging and looking up information. Since there’s so much of it and in so many different languages, we need to keep up and be smart about how we do our text mining and document retrieval, and an essential part of that is using lemmas. This is why here at Bitext we’ve developed state of the art lemmatizers in over 15 languages, from barely inflected languages like English to highly inflected ones like Spanish or even Arabic.
Feel free to try our demo with sample phrases below! You can even use this article and see how our lemmatizer recognizes all the different forms of lemmatizes.