Stemming and lemmatization are methods used by search engines and chatbots to analyze the meaning behind a word. Stemming uses the stem of the word, while lemmatization uses the context in which the word is being used. We'll later go into more detailed explanations and examples.
When running a search, we want to find relevant results not only for the exact expression we typed on the search bar, but also for the other possible forms of the words we used.
For example, it’s very likely we will want to see results containing the form “skirt” if we have typed “skirts” in the search bar. Lemmatization and stemming are applied in this case.
In the case of a chatbot, lemmatization is one of the most effective ways to help a chatbot better understand the customers' queries. Because this method carries out a morphological analysis of the words, the chatbot is able to understand the contextual form of every word and, therefore, it is able to better understand the overall meaning of the entire sentence
Main differences between stemming and lemmatization
The aim of both processes is the same: reducing the inflectional forms of each word into a common base or root. However, these two methods are not exactly the same. The main difference is the way they work and therefore the result each of them returns.
- Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful in some occasions, but not always, and that is why we affirm that this approach presents some limitations. Below we illustrate the method with examples in both English and Spanish.
- Lemmatization, on the other hand, takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma. Again, you can see how it works with the same example words.
Another important difference to highlight is that a lemma is the base form of all its inflectional forms, whereas a stem isn’t. This is why regular dictionaries are lists of lemmas, not stems. This has two consequences:
- First, the stem can be the same for the inflectional forms of different lemmas. This translates into noise in our search results. In fact, it is very common to find entire forms as instances of several lemmas; let’s see some examples.
In Telugu (above), the form for “robe” is identical to the form for “I don’t share”, so their stems are indistinguishable too. But they, of course, belong to different lemmas. The same happens in Gujarati (below), where the forms and stems for “beat” and “set up” coincide, but we can separate one from another by looking at their lemmas.
- Also, the same lemma can correspond to forms with different stems, and we need to treat them as the same word. For example, in Greek, a typical verb has different stems for perfective forms and for imperfective ones. If we were using stemming algorithms we won't be able to relate them with the same verb, but using lemmatization it is possible to do so. We can clearly observe it in the example below:
How do they work?
- Stemming: there are different algorithms that can be used in the stemming process, but the most common in English is Porter stemmer. The rules contained in this algorithm are divided in five different phases numbered from 1 to 5. The purpose of these rules is to reduce the words to the root.
- Lemmatization: the key to this methodology is linguistics. To extract the proper lemma, it is necessary to look at the morphological analysis of each word. This requires having dictionaries for every language to provide that kind of analysis.
How to increase recall beyond lemmatization?
Lemmatization is a common technique to increase recall (to make sure no relevant document gets lost). However, lemmatization may not be enough in many cases and we may need to further increase recall with other techniques.
For example, if you search for information on “John Kennedy”, documents that contain this will be relevant definitely:“JFK”, “John F Kennedy”, “John Fitzgerald Kennedy”
Plus all variations with/without spaces or periods: “John F. Kennedy”…
Another similar example is “cost of labor”, where you want to retrieve also “cost of labour”.
The same thing happens with “bull market” and “bullish market” or “up market”.
These types of semantic equivalents are popularly known as “synonyms” (although in linguistic terms some are not synonyms but acronyms or regional US/UK variations; our point is to stress that there are many types of variations that we need to consider for increasing recall and query expansion).
Making sure that your search engine knows about this language nuances will improve results make the user experience much more positive.
Which one is best: lemmatization or stemming?
As a conclusion, we can say developing a stemmer is far simpler than building a lemmatizer. In the latter, deep linguistics knowledge is required to create the dictionaries that allow the algorithm to look for the proper form of the word. Once this is done, the noise will be reduced and the results provided on the information retrieval process will be more accurate.
- Download our Free Evaluation Dataset and check out how Synthetic Text can solve your training and evaluation problems for your virtual assistants / chatbots (human-machine interfaces)
- Know more about our Bitext Lexical Data Resources support for +100 languages and dialects.
- Request a Demo: Our NLP framework offers a variety of services that can be combined to achieve the best results.
- Download our Benchmark: comparison among the NLTK stemmers and lemmatizer, the Stanford lemmatizer and the Bitext lemmatizer.
- Subscribe to our Newsletter to get the latest updates.
- Stay tuned! Follow Bitext on Twitter or LinkedIn.