A ‘word embeddings’ approach has been widely adopted for machine learning processes. While an extensive research has been carried out during these years to analyze all theoretical underpinnings of algorithms such as word2vec, GloVe or fastText, it is surprising that little has been done, in turn, to solve some of the more complex linguistic issues raised when getting down to business.
In our previous post, ‘Main Challenges for Word Embeddings: Part I' we exposed two main challenges for this approach posed by linguistic phenomena such as homographs and inflection. In today’s post, we are showing some other problems to be easily overcome thanks to technology tools based on linguistics:
Antonyms, Phrases, Entities and Expressions
- Antonyms. While all the issues exposed in our previous post are problematic, there is one particularly risky. Keeping in mind that word embedding algorithms generate vectors - based on the context of words, they tend to generate similar vectors for ‘opposite’ words too. It is not unusual, for instance, that love and adore share a similarity of 0.72. The problem comes, in turn, when love and hate present a close similarity of 0.62. This issue may cause disastrous effects when dealing with text classification in sentiment analysis or conversational agents. On the one hand, a system that cannot properly distinguish between good and bad will not analyze user reviews correctly. On the other hand, a home automation system considering up and down to be the same will have many problems when controlling a thermostat.
Solution: Bitext technologies are solving this problem using lexical knowledge. This solution includes a reliable identification of antonyms and synonyms during the preprocessing stage.
- Phrases, entities, and expressions. Although some words that are spelled alike can be distinguished by their part of speech, the issue of polysemy (same spelling and POS but different meaning) remains unsolved. A good illustration may be the adjective social and its different meanings depending on the context as in social security and social media. When talking about token-based word embeddings, social media is not considered a token. Therefore, it would be hard for a ML system to compare it to the word Twitter, for instance, even if combining both vectors for social and media:
- media vs. Twitter: 0.32
- social + media vs. Twitter: 0.42
Solution: A word embedding model trained on a corpus where all noun phrases are marked as single tokens. In this case, if a comparison between social_media_NounPhrase and Twitter_NOUN is made, a similarity of 0.68 will be easily reached. Bitext approach does not only help to deal with polysemy but also with expressions or verb phrases such as you’d better or I’d rather. Bitext entity extraction tool helps also apply this solution to related entities such as:
- Places: United States, Buenos Aires, New England, New York…
- Companies: Standard & Poor’s, Home Depot…
- Discourse markers: on the one hand, of course, by the way…
- Phrasal verbs: turn on, turn off…
Linguistic knowledge enhances machine learning by applying linguistic solutions to raw data before entering the learning scheme. The study, the evaluation, and the results show that Bitext technology, based on a linguistic approach, can make any machine learning engine reach a higher understanding accuracy at the blink of an eye; not only in standardization but also in information extraction and topic detection.