As the amount of data available grows more and more every day, information retrieval systems are a must tool to have. We are used to see them in search engines, however they are starting to be necessary for other type of businesses.
For the search query procedures, the traditional approach has been stemming but due to its limitations it seems necessary to look for another method, and there is where lemmatization shows up.
The aim of both processes is the same: reducing the inflectional forms and derivations from each word to a common base or root.
When we are running a search, we want to find as many results as possible, and that includes not only the exact word we typed on the search bar but also the ones that have the same root. For example, when we look for the word sewer, it will enrich our findings if we have results containing words like sew or sewerlike.
- Main differences between both processes:
-Stemming algorithms work by cutting off the end of the word, and in some cases also the beginning while looking for the root. This indiscriminate cutting can be successful in some occasions, but not always, that is why we affirm that this an approach that offers some limitations.
-Lemmatization on the other hand takes into consideration the morphological analysis of the words. To do so it is necessary to have detailed dictionaries the algorithm can look back at to link the form back to its lemma.
The main difference is that a lemma is the base form of all its inflectional forms. However, the stem can be the same for the inflectional forms of different lemmas, providing then noise to our search results.
Also, the same lemma can have forms with different stems. For example in greek a typical verb has different stems for perfective forms and for imperfective ones. If we were using stemming algorithms we won't be able to relate them with the same word, but using lemmatization it is possible to do so. We can clearly observe it in the example below:
- How do they work?
-Stemming: there are different algorithms that can be used in the stemming process, but the most common in English is Porter stemmer. The rules contained in this algorithm are divided in five different phases numbered from 1 to 5. The purpose of these rules is to reduce the words to the root.
-Lemmatization: the key to this the methodology is linguistics. To extract the proper lemma, it is necessary to look at the morphological analysis of each word. This requires also to have dictionaries of every language to provide that analysis.
Below you can see some examples in different languages that are known for having a more complex morphology / having a more complex inflection system than English, to prove that using lemmatization approach provides better results.
As a conclusion, it is not so difficult to develop a stemmer, but it is to create a lemmatizer. I It requires a deep knowledge of linguistics to create the dictionaries that allow the algorithm to look for the proper form of the word. Once this is done, the noise will be reduced and the results provided on the information retrieval process will be more accurate with less noise.
If you want to see more applications of lemmatization like textual bases or e-commerce search you can find more information here.
Disclaimer: The examples used in this post have been created by our computational linguists: Clara García, Juan Pedro Cabanilles and Benjamín Ramirez