Using Word2Vec for ontology creation

An ontology is a data structure that groups entities in domains or types (for example the entities ‘dog’ and ‘cat’ are grouped under the type ‘animals’), and establishes relations between those entities. Its uses in Computational Linguistics are vast, one of the most interesting for us is the application of ontologies for chatbot training. When two humans communicate, they have a shared knowledge of the world that they presume in any spoken interaction. However, a chatbot lacks this indispensable knowledge. An ontology can help the chatbot discern that a person can walk a dog, but a dog walking a person is not something possible in our world.

At Bitext, we’re using different technologies for ontology creation like our entity analysis extractor, and our phrase extractor, both available through our API. We have also tested a Word2Vec model commonly used in the Machine Learning field to populate an ontology and identify relations between the different classes of that ontology automatically.

Word2Vec is a group of different statistic models that have been quite successful at the task of meaning representation, especially if we take into account its complexity and importance for NLP. It’s based on Distributional Semantic theory (Harris, 1954); the idea, in simple terms, is that lexical units that have similar meanings should appear in similar contexts (by context I mean other words that appear before and after the word we’re interested in). Makes sense, right? The only question is how do we decide all of that computationally?

Word2Vec libraries do the following: using (preferably) large quantities of data, they produce hyperdimensional spaces in which every lexical unit occupies a point. In other words, they transform words into vectors, hence the name. Words that have similar distributions will have similar vectors, and therefore similar meanings, at least as far as computers are concerned.

Gensim is a python library that takes care of all that. It’s easy to use and, when used well, generates fairly satisfactory results. The only thing it needs is data. Lots of it. However, depending on how we’re planning on using our model, we need to be more or less picky about the quality of the corpus we’re using. When in doubt, the general rule of thumb is the more data we have, the better. It is advisable to preprocess the corpus, though; Gensim already tokenizes texts on its own, and can even take n-grams into account, but it tokenizes using spaces. As a result, Gensim would interpret potato and potato, (with comma) as two different words. The same goes for lower and upper case. It wouldn’t ruin your results, but it is pesky that is why for users that know that even the smaller detail dealing with text is important we offer enterprise grade, tokenizers, and lemmatizators that can be run on premise or through our soon to be launched NLP framework.

Furthermore, depending on the size of the corpus, training the model could take several hours or even days, but luckily you can save the model to disk. That way you don’t have to carry out the computationally arduous task of training the model every time you need to use it.

So now that we have our model, what do we do with it? Gensim incorporates some pretty handy methods; we can determine the similarity between two words, obtain the top whatever number of words that are similar to any given words, etc. When you think about it, if we have a vectorial representation of meaning, the sky’s pretty much the limit. You can even add and subtract meaning. Here’s an example of what I’m talking about:



Similarity is determined by spatial proximity using a cosign function. Basically, the closer the number is to 1, the more similar the words are, and the closer they are to -1, the less similar.


Thus, Word2Vec is a useful tool for our purpose of populating an ontology, and with a little tweaking and additional linguistic preprocessing like lemmatization, and POS tagging, it can help us identify relations between classes of this same ontology as well.

We are working on running the same experiment using linguistically meaningful groups of words, or phrases, using our Phrase extraction API.

The use of NLP tools accelerates and improves the creation of ontologies helping AI and chatbots understand humans.

Download a code sample


Subscribe to Email Updates

Recent Posts


see all