This post dives into one of the topics of a previous post "Machine Learning & Deep Linguistic Analysis in Text Analytics". We referred to the strong points of Machine Learning technology for insight extraction. We also stated that using variations of classical "bag of words" models limits the ability of Machine Learning to extract insights. Here we go into some detail on this last statement.
Most (if not all) commercial solutions for text analysis based on machine learning take a “bag of words” approach. Simply put, this means that all words in a sentence (or paragraph or document) are put in a list or "bag", where the relationships between words are lost (*). The immediate consequence is that in a sentence like “Google acquired ACME” we lose the information on who's the acquirer and who's acquired, because exploiting the knowledge embedded in the sentence structure becomes impossible. Other strategies like stemming lead to "semantically" relating words that are not related like "good" and "goods", or "new" and "news". These issues get worse in multilingual scenarios, where language morphology can be more complex.
Ignoring the structure of a sentence can lead to various types of analysis problems. The most common one is incorrectly assigning similarity to two unrelated phrases such as “Social Security in the Media” and “Security in Social Media” just because they use the same words (although with a different structure).
Besides, this approach has stronger effects for certain types of "special" words like "not" or "if". In a sentence like “I would recommend this phone if the screen was bigger”, we don't have a recommendation for the phone, but this could be the output of many text analysis tools, given that we have the words "recommendation" and "phone", and given that the connection between "if" and "recommend" is not detected.
One typical example in everyday business is the detection of topic in sentiment analysis: in a sentence like "I did enjoy my new car in Madrid", it's very helpful for insight extraction to understand that the positive sentiment is about the new car, and not about Madrid. Using machine learning this task becomes impossible in practice.
If you want to know more about Machine Learning and its applications download our benchmark!