How we trained our chatbot with one billion sentences

A chatbot trained with 1 billion user requests sounds like science fiction, particularly in a field like AI, where scarcity of training data is a widespread problem. Bots need to be fed data that:

• Comes in vast amounts. At this juncture it’s no secret in the past few years the concept of ‘big’ has evolved to a new stage: whatever yesterday was considered big data, today has already become a small dataset. The more data you can collect, the higher your bot’s rate of confidence in intent detection is going to be.

• Is high-quality. When it comes to text data, this means tagged corpora, a dull and not very reliable task. But the quality of data is key to a successful conversation.

• Can be used. Bot designers face legal issues with user data privacy. Most of it must be anonymized or discarded, which only contributes to the data scarcity problem.

So how to obtain this kind of texts to train Geoffrey, our chatbot? We’ve calculated how many utterances a chatbot designer would need to make a powerful and competent conversational bot.

How did we create Geoffrey?

Usually, a chatbot like this will be specialized in a certain field. Bitext has built Domobot, a chatbot for IoT and home automation, with around 100 actions, 300 devices, 100 places and 20 features. We combined them and got all the possible sentences users can say to express their intents. This gave us a first approximation of 10,000,000 base sentences, which translates to the astonishing number of 1,000,000,000 expanded sentences or utterances.

Now that you know how many utterances you need to start having a competent conversational bot, what are you going to do? Are you aware of the time required to write down and tag them all?

I know what you are thinking: you can’t just multiply all the entities because not everything can go with everything. Let’s hear a simple example: imagine you have the entity ‘Devices’, where you have ‘TV’, ‘hi-fi’, ‘computer’, ‘alarm’, etc., and another one with ‘Actions’ that can be performed in the house (‘turn on’, ‘set’, ‘open’…). You can ‘turn on the computer’, you can ‘set the alarm’, but you can’t ‘open the hi-fi’, or ‘set the TV’. These two last utterances don’t make sense and would add noise to the chatbot training process.

The only solution to this headache is to apply linguistic knowledge. Defining relations among the entities assures a lean and sensible corpus of utterances that are relevant for our chatbot.

Besides, there are many variables for the same base sentence, and producing them it’s, again, not a matter of iteration, but a task for a team of linguists: you have to cover phenomena like politeness, redundancy or coordinate propositions, just to mention some of them. That’s why the numbers above won’t make any sense if you try to do a plain multiplication.

So, now that you understand these digits and that training a bot isn’t as simple a task as some might say, don’t lose hope just yet. Bitext NLP middleware for bots automatically generates tagged utterances that are similar, sensible expressions of the user requests you already have. This way you can count, from the first minute, with a sizeable dataset that is high-quality and completely operative.

If you wanna try our bot and see how the technology we have told you above works, click here!

Try Geoffrey  


Subscribe Here!