Training machines to understand language the way humans do is not an easy task, but it is also not impossible. In order to do this, the data given to the bot has to be organized, tagged or flagged in a certain way so the machine can understand and process what the human is trying to say and the context behind it.
At Bitext, our text generation system allows the generation of training datasets that simulate a wide range of natural text (text that is similar to the way we speak and write). Not only does it create different variants of a specific intent or expression, but it also expresses the diverse ways in which an intent can be expressed in real life. Additionally, the utterances contain flags that mark the specific features that are being covered. This allows us to tailor the dataset to express the way a specific demographic writes or communicates.
To provide some examples, utterances can be:
- neutral (“reset my password”)
- polite (“would you be so kind as to reset my password, please?”),
- colloquial (“I wanna reset my pw”)
- with misspellings (“recvoer my password”)
- synonyms (“I want to reset my pw”)
- regional features (“show me the nearest ABM”, Canadian variant of “ATM”)
- offensive language (“I want my f***ng password”)
- code-switching (“recover my password, s’il vous plait”)
- “keyword-style” language (“pw”)
- and many other characteristics like anonymization, tokenization or spell checking