When we are running a search, we want to find relevant results not only for the exact expression we typed on the search bar, but also for the other possible forms of the words we used. For example, it’s very likely we will want to see results containing the form “skirt” if we have typed “skirts” in the search bar.
This can be achieved through two possible methods: stemming and lemmatization. The aim of both processes is the same: reducing the inflectional forms of each word into a common base or root. However, these two methods are not exactly the same; this article will go over these differences along with some examples.
Main differences between stemming and lemmatization
The main difference is the way they work and therefore the result each of them returns
- Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful in some occasions, but not always, and that is why we affirm that this approach presents some limitations. Below we illustrate the method with examples in both English and Spanish.
- Lemmatization, on the other hand, takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma. Again, you can see how it works with the same example words.
Another important difference to highlight is that a lemma is the base form of all its inflectional forms, whereas a stem isn’t. This is why regular dictionaries are lists of lemmas, not stems. This has two consequences:
- First, the stem can be the same for the inflectional forms of different lemmas. This translates into noise in our search results. In fact, it is very common to find entire forms as instances of several lemmas; let’s see some examples.
In Telugu (above), the form for “robe” is identic to the form for “I don’t share”, so their stems are indistinguishable too. But they, of course, belong to different lemmas. The same happens in Gujarati (below), where the forms and stems for “beat” and “set up” coincide, but we can separate one from another by looking at their lemmas.
- Also, the same lemma can correspond to forms with different stems, and we need to treat them as the same word. For example, in Greek, a typical verb has different stems for perfective forms and for imperfective ones. If we were using stemming algorithms we won't be able to relate them with the same verb, but using lemmatization it is possible to do so. We can clearly observe it in the example below:
How do they work?
- Stemming: there are different algorithms that can be used in the stemming process, but the most common in English is Porter stemmer. The rules contained in this algorithm are divided in five different phases numbered from 1 to 5. The purpose of these rules is to reduce the words to the root.
- Lemmatization: the key to this methodology is linguistics. To extract the proper lemma, it is necessary to look at the morphological analysis of each word. This requires having dictionaries for every language to provide that kind of analysis.
Which one is best: lemmatization or stemming?
As a conclusion, we can say developing a stemmer is far simpler than building a lemmatizer. In the latter, deep linguistics knowledge is required to create the dictionaries that allow the algorithm to look for the proper form of the word. Once this is done, the noise will be reduced and the results provided on the information retrieval process will be more accurate.
- Know more about our Bitext Lexical Data Resources support for +100 languages and dialects.
- Request a Demo: Our NLP framework offers a variety of services that can be combined to achieve the best results.
- Download our Benchmark: comparison among the NLTK stemmers and lemmatizer, the Stanford lemmatizer and the Bitext lemmatizer.
- Subscribe to our Newsletter to get the latest updates.
- Stay tuned! Follow Bitext on Twitter or LinkedIn.
Disclaimer: The examples used in this post have been created by our computational linguists: Clara García, Juan Pedro Cabanilles and Benjamín Ramirez.