While a lot of research has been devoted to analyzing the theoretical basis of word embeddings, not as much effort has gone towards examining the limitations of using them in production environments
(This article is the first of a series about word embeddings as the basis for user-facing text analysis applications.)
Quick intro. Word embeddings are essentially a way to convert text into numbers so ML engines can work with text input. Word embeddings map a large one-hot vector space to a lower-dimensional and less-sparse vector space.
This vector space is generated by applying the ideas of distributional semantics, namely, that words that appear in similar contexts have similar behavior and meaning, and can be represented by similar vectors.
As a result, vectors are a very useful representation when it comes to feeding text to ML algorithms, since they allow the models to generalize much more easily.
While these techniques have been available for a few years, they were computationally expensive.
It was the appearance of word2vec in 2013 that led to the widespread adoption of word embeddings for ML, since it introduced a way of generating word embeddings in an efficient and unsupervised manner – at least initially, it only requires large volumes of text, which can be readily obtained from various sources.
Accuracy. In general, the “quality” of a word embedding model is often measured by its performance in word analogy problems: the closest vector to king − man + woman is queen.
While these analogies are nice showcases of the ideas of distributional semantics, they are not necessarily good indicators of how word embeddings will perform in practical contexts.
A lot of research has been devoted to analyzing the theoretical underpinnings of:
- word2vec, as well as similar algorithms, such as:
- Stanford’s GloVe and
- Facebook’s fastText.
However, surprisingly little has been done towards examining the accuracy of using word embeddings in production environments.
Let’s examine some accuracy issues using the English pre-trained vectors from Facebook’s fastText. For that, we will compare words and their vectors using cosine similarity, which measures the angle between two vectors.
In practice, this angle ranges from about 0.25 for completely unrelated words to 0.75 for very similar one.
Problem 1. Homographs & POS Tagging
Current word embedding algorithms tend to identify synonyms quite well. For example, the vectors for house and home have a cosine similarity of 0.63, which indicates they are quite similar, whereas the vectors for house and car have a cosine similarity of 0.43.
We would expect the vectors for like and love to be similar too. However, they only have a cosine similarity of 0.41, which is surprisingly low.
The reason for this is that the token like represents different words: the verb like (the one we expected to be similar to love) and the preposition like, as well as like as adverb, conjunction…
In other words, they are homographs – different words with different behaviors but with the same written form.
Without a way to distinguish between verb and preposition, the vector for like captures the contexts of both, resulting in an average of what the vectors for the two words would be, and is therefore not as close to the vector for love as we would expect.
In practice, this can significantly impact the performance of ML systems such as conversational agents or text classifiers.
For example, if we are training a chatbot/assistant, we would expect the vectors for like and love to be similar, so queries like I like fat free milk and I love fat free milk are treated as semantically equivalent.
How can we get around this problem?
The easiest way is to train word embedding models using text that has been preprocessed using POS (part-of-speech) tagging. In short, POS tagging allows to distinguish between homographs by isolating different behaviors.
At Bitext we produce word embeddings models with token+POS, rather than only with token, as in Facebook’s fastText; as a result, like|VERB and love|VERB have a cosine similarity of 0.72.
We produce these models now (Q4 2021) in 14 languages (English, Spanish, German, French, Italian, Portuguese, Dutch, etc view all here) and new ones are in the pipeline.
By Daniel Benito, Bitext USA; & Antonio Valderrabanos, Bitext EU
(New articles will follow on other language phenomena that negatively impact the quality of word embeddings.)