This post dives into one of the topics of a previous post "How to Make Machine Learning more effective using Linguistic Analysis". We referred to the strong points of Machine Learning technology for insight extraction. We also stated that text analysis is not the area where machine learning shines the most. Here we go into some detail on this last statement.
Statistical techniques are good for analyzing highly complex phenomena that are hard to model because our knowledge of them is scarce. Two examples:
- the weather or
- the stock markets.
Most (if not all) commercial solutions for text analysis based on machine learning technology take a “bag of words” approach.
Simply put, this means that all words in a sentence (or paragraph or document) are put in a list or "bag", where the relationships between words are lost (*).
The immediate consequence is that in a sentence like “Google acquired ACME” we lose the information on who's the acquirer and who's acquired, because exploiting the knowledge embedded in the sentence structure becomes impossible.
Other strategies like stemming lead to "semantically" relating words that are not related like "good" and "goods", or "new" and "news". These issues get worse in multilingual scenarios, where language morphology can be more complex.
Ignoring the structure of a sentence can lead to various types of analysis problems. The most common one is incorrectly assigning similarity to two unrelated phrases such as “Social Security in the Media” and “Security in Social Media” just because they use the same words (although with a different structure).
Besides, this approach has stronger effects for certain types of "special" words like "not" or "if". In a sentence like “I would recommend this phone if the screen was bigger”, we don't have a recommendation for the phone, but this could be the output of many text analysis tools, given that we have the words "recommendation" and "phone", and given that the connection between "if" and "recommend" is not detected.
One typical example in everyday business is the detection of topic in sentiment analysis: in a sentence like "I did enjoy my new car in Madrid", it's very helpful for insight extraction to understand that the positive sentiment is about the new car, and not about Madrid. Using machine learning this task becomes impossible in practice.
(*) Some solutions integrate statistical and linguistic knowledge, like the Stanford parser, covered in this post in our blog.
Did you like this post? Remember to leave your comments and share!
You could be interested in our Methodology where you could find the process we do setting up and training a bot.