Two concepts, one mission: to make machines understand humans. Natural Language Processing (NLP) and Machine Learning (ML) are all the rage right now, but people tend to mix them up. In this post, there will be a distinction between these two different but complementary terms in the field of Artificial Intelligence.
Table of Contents: NLP vs ML
- What is Natural Language Processing (NLP)?
- What is Machine Learning (ML)?
- How Bitext Enhances Machine learning through NLP
What is Natural Language Processing (NLP)?
Natural Language Processing (NLP) is the subfield of computer science able to make computer systems understand human language as humans naturally speak and type. Humans often use language at their wish, most of the time even using abbreviations, misspelling, slang... These variations make it harder for computers to analyze human language. However, NLP and Machine Learning (ML) have lately been making great progress towards solving these issues. Bitext brings a unique approach to the market of Natural Language. As experts in computational linguistics, we are continuously developing new tools designed to boost accuracy when machines read and understand human utterances.
Understanding natural language involves different factors that must be considered:
- Semantics is the branch of linguistics dealing with the meaning of words. A sentence like ‘The bear painted a picture of the landscape’ is incorrect because of the meaning of the verb ‘to paint’, and action that must be done by a human being. To realize that the sentence makes no sense, one must know the real definition of the word ‘paint’ here.
- Syntax. Text structure is important to get what is meant, but more important is the structure of its sentences, also known as syntax. Given a sentence such as, ‘Laura joined the team having some experience. Who exactly has the experience? Laura, or the team? Laura’s abilities are determined here depending on the reader’s mind.
- Context. The last two factors are essential to get a good understanding, but sometimes there are some external aspects you must understand so you know what a text is talking about. Here it comes to the context where a sentence or word appears in. If someone says, ‘that’s wicked!’, is there a positive or a negative meaning behind? One must know about the context in which it was uttered to get it.
Let’s see everything together in the sentence mentioned above: ‘The bear painted a picture of the landscape’:
- Semantics: animal – the act of to represent by or as if by a picture – a design or representation made by various means – a portion of territory.
- Syntax: subject – action – direct object.
- Context: this sentence is about a bear painting a picture.
These three on their own are not enough, but a combination of them is what gives a full language understanding of the sentence.
When getting down to work, this sentence can be analyzed using machine learning for text analytics. However, the results are not good enough. ML models are quite helpful to detect entities and sentiment in a document, but there’s room for improvement when extracting topics or themes. Nevertheless, as Natural Language Processing (NLP) and Machine Learning(ML) techniques have evolved over the years, such issues are being addressed as time goes by. Therefore, at Bitext, we used a hybrid approach which greatly increases the accuracy of the results obtained
When talking about text analytics, machine learning is considered a combination of statistical techniques which serves to detect patterns including sentiment, entities, parts of speech and other phenomena within a text. There are two kinds of machine learning procedures: supervised and unsupervised. Supervised machine learning is the process where the ML techniques can be expressed as a model which can be also applied to other texts. There are also some algorithms working across extensive data sets to obtain meaning, also known as unsupervised machine learning. Knowing the differences between supervised and unsupervised learning and how to get the best of both in one system is essential to get the best results:
- Supervised Machine Learning- On the one hand, in supervised machine learning, there is a big amount of manually-tagged documents to find patterns in a text. This set of tagged documents can be used to train statistical models to be applied afterward to new texts. The bigger the data set, the better results: every model can be trained multiple times to enhance its learning. Deep Learning, for instance, is considered a supervised machine learning technique and it’s what Bitext platform is based on. The key differentiator here is that the machine learning techniques are ‘guided’ to some extent.
- Unsupervised Machine Learning- On the other hand, the statistical models used to extract meaning from texts in unsupervised machine learning don’t require any pre-tagged sets, as seen before. ‘Clustering’ is considered one of these unsupervised ML techniques: the act of gathering documents together into clusters through a hierarchical relationship. There is another unsupervised technique known as ‘latent semantic indexing’ to get those words that often appear together in texts. This technique can be used for multifaceted document search: if the words ‘TV’ and ‘channel’ are related in many texts, then you’ll very likely get documents back containing ‘channel’, even if you just search for ‘TV’.
- Tokenization- Tokenization is a natural language processing task involving regular expressions. In English, for instance, it can be considered a quite easy task due to the spaces between words. However, in Mandarin Chinese – where there are no spaces – an adequate algorithm and high-quality dictionaries are crucial to identify rules and patterns.
- POS-Tagging-Part-of-Speech Tagging is used for several NLP tasks such as topic or entity extraction. At Bitext, our NLP models are built to tag ‘parts of speech’ with up to 90% accuracy, even for slang and language variants used in social media.
- Entity Extraction- Our natural language processing model, used for the extraction of named entities in a text, is able to recognize up to 15 different entities: people, places, phone numbers, email addresses, companies, URLs, money, Twitter users… and much more. We must not forget that the Entity Extraction tool relies on a former POS-tagging as an input feature.
- Sentiment Analysis-Bitext Sentiment Analysis tool is based on topics and identifies opinions and emotions regardless of the source: surveys, reviews, searches, conversations, etc. It analyzes opinions found in texts to detect an emotional response from users in more than 20 languages. First, it identifies the topic(s) that are being discussed in a particular text and then evaluates the opinion(s) expressed about the topic(s) and its polarity (positive/negative).
- Categorization- Our categorization service classifies texts into groups according to customized categories. Our team of computational linguists create rules based on linguistic analysis so that the accuracy reached is more stable. Take a look at our case study for the automotive industry to get a better idea.
Apart from these core elements which greatly help natural language understanding (NLU), we cannot forget the natural language generation (NLG) process which serves to train AI. At Bitext, we developed a brand-new solution called artificial training data.
Bots built upon machine learning need long training processes to have the ability to hold a meaningful conversation with real people. Training data becomes, therefore, a diamond in the rough; all companies need such input for their bots. Until now, this data was generated in a slow manual way. However, speeding up your bot training can now come true with artificially generated data. This process makes it possible for them to reduce the cost and time wasted in generating data for Machine Learning training.