Evaluate the Quality of your Chatbots and Conversational Agents

Chatbots require large amounts of training data to perform correctly. If you want your chatbot to recognize a specific intent, you need to provide a large number of sentences that express that intent, usually generated by hand. This manual generation is error-prone and can cause erroneous results. How can we solve it? With artificially-generated data.

Since Dialogflow is one of the most popular chatbot-building platforms, we chose to perform our tests using it. We tested how Dialogflow can benefit from the Artificial Training Data approach, comparing bots trained using hand-tagged sentences with bots that used automatically-generated training data. Our tests show that if we train bots with only 2 or 3 example sentences per intent in Dialogflow, performance suffers. Furthermore, using 10 sentences per intent, there is only minimal improvement. On the other hand, by extending these hand-tagged corpora with additional variants automatically generated by Artificial Training Data, there is higher overall improvement and accuracy.

We carried out two different tests (A and B), both using the following 5 intents related to the house lighting management. In the first test (A), we trained two different bots: 

  • A first bot (A1) was trained with only 12 hand-tagged sentences (2 to 3 sentences per intent). Using those sentences as input, our Bitext Artificial Training Data service generated 391 sentences which, combined with the 12 sentences from bot A1, were used to train a second bot A2 (with around 80 sentences per intent).
  •  The second test (B) was very similar to the first. The only difference was the number of sentences used in the training and evaluation sets. In this case, the first bot (B1) was trained with a hand-tagged training set of 50 sentences (10 per intent). Using those sentences as input, our Bitext Artificial Training Data service generated 798 sentences which, combined with the 50 sentences from bot B1, were used to train the second bot B2 (with around 170 sentences per intent). We used the same 100 evaluation sentences from test A as the evaluation set.

In both tests, we observed a significant improvement reaching at least 90% accuracy in both intent detection and slot filling. Do you want to see the results for yourself? Download our Dialogflow Full Benchmark Dataset now.

The Bitext Artificial Training Data service lets you create big training sets with no effort. If you only want to write one or two sentences per intent, our service is able to generate the rest of the variants needed to go from poor results to great accuracy.

 

If you would like to get further details, you can check some additional tools here:

Subscribe to email Updates