Lemmatization for information retrieval

In the last years, the size of available data and information collections is growing exponentially, and this makes necessary to adapt the tools we use so we can access information more efficiently.

Information retrieval methods have been developed and enhanced to help users looking for the right information and represent it in an understandable way to users.

The massive usage of the Internet all over the world increases and the access to tremendous volumes of information retrievable at any given time. However, this means that both relevant and nonrelevant data will be retrieved, slowing down the retrieval process. 

Two aspects are essential in the retrieval process: speed and relevancy. To achieve optimal results, there are various mechanisms developed during the past years that can be applied: Boolean models, Vector space models, Stemming and Lemmatization techniques.

  • Boolean models: the most common exact-match model
  • Vector space models: an algebraic model for representing text documents as vectors of identifiers
  • Stemming: the process of reducing inflected words to their root form
  • Lemmatization techniques: the algorithmic process of determining the lemma of a word based on its intended meaning

During a retrieval process, the user types a query to describe what he is looking for, and then the system will choose the relevant keywords from the request. The system compares the keywords with the documents and when similarities are found that document is retrieved and then matched against the rest of the retrieved documents for ranking purposes.

Two aspects are essential in the retrieval of information: Speed and Relevancy and two techniques introduced before are the best candidates to help to improve the language models regarding speed and relevance: the Stemming and the Lemmatization.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. However, they differ in some aspect.

Differences between lemmatization and stemming

Stemming algorithms work by cutting off the end of the word, and in some cases also the beginning while looking for the root. This indiscriminate cutting can be successful on some occasions, but not always, that is why we affirm that this an approach that offers some limitations.

The Lemmatization takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries the algorithm can look back at to link the form back to its lemma. With lemmatization turned on, a search term is reduced to its "lemma," and inflected forms of the word are retrieved.

For example:

  • lemma -> to dive inflected forms: dive, dived, diving
  • lemma -> cat inflected form: cats
  • lemma -> hermanas inflected form: hermano

In lemmatization, the parts of speech and context of words determine their respective base or lemmas.

The information retrieval and the search engines always utilize lemmatization to gain a better understanding of a user’s query and serve the most relevant result.

When these techniques are used, the number of indexes used is reduced because the system will be using one index to present some similar words which have the same root or stem. 

The aim of storing an index is to optimize speed and relevance in finding keys documents for a search query. Without an index, the information retrieval or search engine would scan every document in the corpus, which would require considerable time and computing power.

Download our benchmark on lemmatization

Subscribe to Email Updates