Artificial Training  Data for Chatbots

When working on AI projects, owning data to nurture your solution is key for good performance. Gathering e-mails and conversation logs to train your bot may be as good as a makeshift solution, but this lack of data can now be cut off at the root. Why not start farming your own data instead of harvesting it?

Building effective customer support agents requires large amounts of data to understand every query made by the user. Nevertheless, obtaining and manually tagging example utterances for AI training is expensive, time-consuming and error-prone:

  • On the one hand, smaller companies are stuck trying to come up with examples of the various ways in which users can request intents supported by the bots. 
  • On the other hand, even large companies with extensive customer support chat logs must manually tag the unstructured data so that it can be used for AI purposes.

Both are slow processes that will probably lead to inconsistencies and overall poor NLU performance.

Too often, companies also get a simple bot up and running hoping that users’ interactions will produce enough logs to improve and augment the training data. This approach is risky since a bot performing poorly may drive users away, and the resulting low engagement means that not enough data is collected.

We propose an entirely different approach: generating artificial training data.

What is Artificial Training Data

Artificial training data, also called synthetic data, is not a brand-new idea – it has been used in various Machine Learning (ML) fields, including computer vision, especially for self-driving cars, either augmenting existing data by transforming images (mirroring, darkening, etc) or generating completely new data – such as adapting driving simulation games to act as environments to train self-driving cars.

However, usefulness is limited by how well we can model the data we are trying to generate – for example, synthetic data is used extensively in physics computer simulations, where the ’rules’ are well-known. At the same time, advancements are being made in training GANs (general adversarial networks), where one network generates data and another one tries to detect ‘fake’ data to optimize the generator so that it can generate synthetic data that is indistinguishable from real data.


Artificial Data for Chatbots

As in physics, the rules that govern the language are well known – humans have been studying the language for hundreds of years. As seen in our previous post, artificial training data helps automate your bot’s training phase.

In the AI field, you can make use of ontologies/knowledge graphs to model a specific domain (for example, retail), describing the relevant objects, actions, modifiers and the ways in which they are related to one another. Using linguistics, you can define structures for the various ways in which these words can be expressed in language – covering changes in morphology, syntax, synonyms, different levels of politeness, commands/questions. After that, this ‘generated data’ is correct, fully tagged, consistent and customizable (e.g. for specific sub-domains).

The generation of a new vertical only requires building a new ontology, which can be highly automated using various NLP tools. Results can be incrementally improved to handle even non-explicit implied requests (e.g. ‘I forgot my password’ should be interpreted as a request to reset a user’s password) as they are incurred.

While AI algorithms have become a commodity, useful data is lacking in training them. Smaller companies do not have enough resources or access to the large volumes of training data required to train high-quality models. Therefore, artificial (synthetic) training data generation is the answer to ‘democratize’ the field. Thus, results are paramount when data can be modeled using well-known rules (such as physics or language).

Artificial Training Data for Chatbots

If you would like to get further details, you can check some additional tools:











Subscribe Here!