Text analysis is becoming a pervasive task in many business areas. Machine Learning is the most common approach used in text analysis, and is based on statistical and mathematical models. Linguistic approaches, which are based on knowledge of language and its structure, are far less frequently used. These two approaches are often seen as alternative or competing approaches.
This view is a major obstacle to the progress of the Big Data industry, where text represents a large percentage of big data. The two approaches are indeed complementary and cooperative approaches that, when properly combined, provide the most effective way of extracting high-quality insights from big data.
The misconception that these two approaches compete predominates in the industry. We disagree: machine learning and linguistic approaches can work together. In fact, they should: linguistic approaches are ideal for understanding language and providing it with structure; machine learning cannot understand this structure but needs it to extract accurate insights from text data. So each discipline has a “sweet spot“.
Linguistic Analysis is in a better position to extract structure from text. On the one hand, Machine Learning typically handles text in a “naïve“ way, as a flat set of strings (using different versions of the classical “bag of words“ approach). So sentences like “dog bites man“ and “man bites dog“ look the same. This poses a limitation on the amount of information that Machine Learning can extract. On the other hand, Deep Linguistic Analysis is based on knowledge about language (grammars, ontologies and dictionaries) and it can handle the structure of language at all levels (morphology, syntax and semantics). By taking into account the structure of language, Deep Linguistic Analysis understands complex phenomena like negation (“I never liked it“) and conditionality (“I'd like it if it were cheaper“) accurately, especially in complex cases where two sentences have a similar wording but entirely different meanings (like “I don’t plan to buy this product” and “if I don’t buy this product today I can buy it tomorrow”). So Deep Linguistic Analysis is specifically designed to find the structure in (apparently) unstructured text.
However, Machine Learning is in a better position to extract insights (from previously analyzed and structured text, rather than unstructured), while Linguistics has nothing to do with insight extraction.
And we can take advantage of these two facts if we do things in the right order. First, Deep Linguistic Analysis generates a rich and accurate representation of the structure of texts; second, Machine Learning uses this structure to extract insights from actual features, which is the task that it naturally excels at.