You may have read the NLP abbreviation lately, but even if it is related to terms like Big Data or Data Discovery maybe you haven’t realized what it stands for or even that you are using it daily.
NLP or Natural Language Processing allows computers and machines to analyze, understand, and infer knowledge from natural language, by using artificial intelligence, computational linguistics, and computer science. That is something that only humans were able to do.
The idea of this post is getting you familiar with this technology and understand how Natural Language Processing and Computational Linguistics can be helpful for business purposes.
To understand how a machine can do the task that until not so long was done only by humans we should start by understand how natural language works to see how it’s composed and what are the challenges to face.
Words are the base of language, so it’s the first item to consider in our analysis. Each word has a different linguistic nature: they can be nouns, adjectives, verbs, etc. and they can have attributes like singular, plural, masculine, feminine, indicative, subjunctive… How can the machine know the nature of each words?
There are two different methodologies: using machine learning algorithms or applying linguistics.
The first approach is widely used in the text analytics market, for machine learning the idea is providing the machine with thousands of examples and “teach” the machine the nature of each word. However, as we explained in a previous post, this approach is in general blind to context and has difficulties dealing with ambiguity.
Therefore, there is another approach in the market that uses linguistics to provide more accurate results by providing effective tools to deal with ambiguity: Lexicons, grammar; ontologies... How? By using POS tagging. This allows to analyze each word in the context of the text it is present in.
Analyzing words and understanding their nature is not enough, people write using structures so the information can be understood by other users, so if we want machines to do human task they should understand also phrase structure or sentence structure.
Syntax will allow us to understand phrase structure, how? By telling us how each word relates to others inside a phrase and what the role of each phrase is at the sentence level. This analysis is made by a tool called parser.
For those who may not know what a parser is, let’s briefly explain that: in computational linguistics parser refers to a formal analysis of a phrase made by a computer program that offers as a results a parse tree that shows the syntactic relation between words and sometimes even semantic information. At Bitext we have developed our own and it’s the cornerstone and the key of our technology: a language-independent lexical analyzer and a PDA-based non-deterministic GLR parser.
As a conclusion NLP is a growing field that can be useful to many industries and also different departments inside a company because of the variety of tasks it can do like: sentiment analysis, entity extraction, topic segmentation or text categorization. However, it has some challenges to overcome, particularly regarding accuracy, in a fast changing environment not only speed in analyzing data is important but also accuracy, that is why in Bitext we trust linguistics to achieve a 90% precision.
If you are getting started on this field download our Python scripts for data analysis: information retrieval, text normalization and text processing!