Machine learning algorithms require a great amount of numeric data to work properly. Real people, however, do not speak to bots using numbers, they communicate through the natural language. That’s the main reason why chatbot developers need to convert all these words into digits so that those virtual assistants can understand what users are saying. And here is where word embeddings come into play.
Word embeddings are nothing but numerical representations (in this case, vectors) of texts. These vectors are not randomly assigned but are generated considering the context in which each word is often used. That’s to say, words used in similar contexts will be represented by similar vectors, therefore, two words as red and yellow will show a close vector representation. This phenomenon is quite useful for machine learning algorithms since it allows models to generalize much more easily. Thus, if models are trained with a single word, all similar vectors to this word will be similarly understood by the machine.
The so-called ‘quality of a word embedding model’ is often measured according to its performance when dealing with word analogies. An efficient system should recognize that the word queen is related to the word king the same way as the word woman is related to the word man. While these analogies may reflect the principles of distributional semantics, they are actually not a good illustration of how word embeddings perform in practical contexts. There are some linguistic phenomena posing big challenges for word embeddings. Here we will explain two of these problems: homographs and inflection.
Homographs and Inflection
- Homographs. Current word embedding algorithms tend to identify synonyms efficiently. As a result, the vectors for the words house and home share a cosine similarity of 0.63, what means that they are alike to some extent. Thus, the vectors for like and love are expected to be similar too. Nevertheless, they show a cosine similarity of just 0.41, what is surprisingly low. That’s because the word like is not only a verb but also, a preposition, an adverb, and even a noun. In other words, all these terms are homographs: different words sharing the same spelling. Since there is no way to distinguish between these identical words, the vector used for the word like must include all the contexts where the word appears resulting, then, in an average of all vectors. That’s why the vector for like is not as close to love as expected. When put into practice, this reality can significantly impact on the performance of ML systems posing a potential problem for conversational agents and text classifiers.
Solution: Training word embedding models using text preprocessed through part-of-speech tagging: in Bitext token+POS model, both verbs, like and love, have a cosine similarity of 0.72. Here, a POS-tagging tool distinguishes homographs by separating different behaviors depending on their word classes.
- Inflection. Another problem challenging standard word embeddings are word inflections (alterations of a word to express different grammatical categories). When looking, for instance, at the verbs find and locate they present a similarity of 0.68, almost as close as expected. However, if the inflected forms (past tense or participle, for example) of those verbs are compared, an unusual similarity of 0.42 between found and located comes up. That’s because some word inflections appear less frequently than others in certain contexts. As a result, there are fewer examples of those ‘less common’ words in context for the algorithm to learn from them resulting, therefore, in ‘less similar’ vectors. For all that, a far bigger issue emerges when using languages with a greater level of inflection. While English verbs may have a maximum of 5 different forms (e.g., go, goes, went, gone, going), Spanish verbs present over 50 inflections and Finnish over 500. No matter how large these amounts of training data are, there will not be enough examples of the ‘less common’ forms to help the algorithm generate useful vectors.
Solution: Training word embedding models using text preprocessed through lemmatization: in Bitext token+lemma+POS model, found_find_VERB and located_locate_VERB have a cosine similarity of 0.72. Here, Bitext lemmatizer helps alleviate those shortages of sample data by unifying all different forms of a word into their canonical lemma (root).
These are not the only linguistic challenges this approach must face. Stay tuned – our next post will expose more challenges and solutions for word embeddings.