Although Machine Learning algorithms have been around since mid-20th century, this technology along with Deep Learning is the newest popular boy in town, with good reason. Due to recent advances in computing power and data availability, they're being more and more used to perform astonishing tasks.
For example, we can now classify CT brain images towards an early diagnosis of Alzheimer's disease, build recommendation systems based upon users' behavior or use neural networks to improve machine translation. This is why ML and DL have become an absolute trend in tech nowadays.
However, in general but in NLP specifically, only a few can afford end-to-end DL solutions, that is, putting a single neural network in charge of the entire job. Usually, preprocessed data is what serves as input of the DL system. But why do we need this previous step? And how is this preprocessing done?
Sometimes, before jumping to build our model it's interesting to look at our data and apply some transformations. This may be seen as manipulating the reality, but not in the slightest, as it is actually done to achieve better results next, in the DL solution.
As fast as machines might learn, there is not enough data to learn from scratch some key features of natural languages. Certainly it's not within everyone's reach due to its high computational costs.
Today we are going to talk about two transformations that are aimed to reduce the noise and solve the ambiguities that are typical in all natural languages.
How is preprocessing done?
When you're doing NLP, it's very likely that in your data there are values that look different but can be treated as if they were the same. For example: "buy", "buys", "buying" or "bought" all contribute with the same semantical value (or meaning) and you will want to treat them equally in a sentiment analysis task, for example.
For this kind of preprocessing you need lemmatization, which provides the lemma of every word. For all the examples cited above, then, it will return the lemma "buy". Note that lemmatization isn't the same as stemming: the latter lacks linguistic knowledge and fails to recognize the lemmas: it will return "buy" for every example but "bought", as it has another stem.
Also, the opposite may happen: inputs that take the same appearance are radically different in their nature and use. This is the case of polysemic words, that is, words that have several meanings and, unfortunately for engineers, exist in every natural language.
Take, for example, the word "but". As a conjunction it anticipates the next words in the sentence are to oppose to the previous words. On the other hand, as a preposition it has the same meaning as "except". It would be useful for our DL system, therefore, to take into account these labels (such as "conjunction" or "preposition") for each string to differentiate them and perform better.
This is precisely what POS tagging, another linguistic-based preprocessing task, does. "POS" stands for "part-of-speech", which is how linguists refer to the syntactical behavior of words, what you may have heard as grammatical categories. They are indeed very useful in linguistics and used widely to distinct one word from another.
If you want to know more about how these technologies work, see more examples and learn how it can improve results of DL solutions, download our whitepaper on the topic below. Also, both lemmatization and POS tagging services are available at our Deep Linguistic Analysis Platform.