Lemmatization vs Stemming

Almost all of us use a search engine in our daily working routine, it has become a key tool to get our tasks done. However, with each minute the amount of data and resources available grows exponentially, and providing high quality results that match user’s queries becomes more complex.

One of the issues that complicates the process are the ambiguous words. This type of terms have different meanings depending on their function in the sentence. Let’s see an example:

  • Let's take five minutes break in this meeting.
  • This vase made of glass can break easily.

In both sentences we have used the word break, however it has different meanings, in the first one acting as a noun and it means tin, while in the second sentence is acting as a verb and it means possibility.

When we are working with large databases to look for precise information, ambiguous words may complicate our search, because results retrieved by the search engine will include data containing the term “can” in both meanings, while maybe we are just interested in one of them.  Some of them will be interesting to us, but the others are just noise that slow down our job.

Ambiguity might not be the top problem while searching in English, however it plays a mayor role in high inflected languages, like French, Spanish or Polish. These languages commonly use declension and adjectives, pronouns and noun inflections.

How does inflection affect search?

When introducing a term in the browser it needs to be normalized, bot at the query and the index time, so what the user is looking for can match a term that is contained in the database. To normalize any word there are two different approaches:

  • Lemmatization: based on its usage, the machine looks for the appropriate dictionary form of the word.
  • Stemming: characters are removed of the end of the word by following language-specific rules.

In weak inflected languages, the method chosen may not influence  the quality of the results. But our internal research has showned that for hinghly inflected languages the chosen process determines the accuracy of the end results.

The main advantage of lemmatization is that it takes into consideration the context of the word to determine which is the intended meaning the user is looking for. This process allows to decrease noise and speed up the user’s task.

You can see here an example of what happens when we look for an ambiguous word in French, as you can see if we follow stemming methodology, noise will be much higher.

In most of the cases for an ambiguous word coming from two different words the stem will be the same. While if we pay attention to the lemma we see the difference.

lemma fraces.png

If you are interested in additional multilingual examples download our benchmark!

 

Download our benchmark on lemmatization

Subscribe Here!