An ontology is a data structure that groups entities in domains or types (for example the entities ‘dog’ and ‘cat’ are grouped under the type ‘animals’), and establishes relations between those entities. Its uses in Computational Linguistics are vast, one of the most interesting for us is the application of ontologies for chatbot training. When two humans communicate, they have a shared knowledge of the world that they presume in any spoken interaction. However, a chatbot lacks this indispensable knowledge. An ontology can help the chatbot discern that a person can walk a dog, but a dog walking a person is not something possible in our world.
At Bitext, we’re using different technologies for ontology creation like our entity analysis extractor, and our phrase extractor, both available through our API. We have also tested a Word2Vec model commonly used in the Machine Learning field to populate an ontology and identify relations between the different classes of that ontology automatically.
Word2Vec is a group of different statistic models that have been quite successful at the task of meaning representation, especially if we take into account its complexity and importance for NLP. It’s based on Distributional Semantic theory (Harris, 1954); the idea, in simple terms, is that lexical units that have similar meanings should appear in similar contexts (by context I mean other words that appear before and after the word we’re interested in). Makes sense, right? The only question is how do we decide all of that computationally?
Word2Vec libraries do the following: using (preferably) large quantities of data, they produce hyperdimensional spaces in which every lexical unit occupies a point. In other words, they transform words into vectors, hence the name. Words that have similar distributions will have similar vectors, and therefore similar meanings, at least as far as computers are concerned.
Gensim is a
Furthermore, depending on the size of the corpus, training the model could take several hours or even days, but luckily you can save the model to disk. That way you don’t have to carry out the computationally arduous task of training the model every time you need to use it.
So now that we have our model, what do we do with it? Gensim incorporates some pretty handy methods; we can determine the similarity between two words, obtain the top whatever number of words that are similar to any given words, etc. When you think about it, if we have a vectorial representation of meaning, the sky’s pretty much the limit. You can even add and subtract meaning. Here’s an example of what I’m talking about:
Thus, Word2Vec is a useful tool for our purpose of populating an ontology, and with a little tweaking and additional linguistic preprocessing like lemmatization, and POS tagging, it can help us identify relations between classes of this same ontology as well.
We are working on running the same experiment using linguistically meaningful groups of words, or phrases, using our Phrase extraction API.
The use of NLP tools accelerates and improves the creation of ontologies helping AI and chatbots understand humans.