Noisy text is realistic text

One of the flaws of usual training data generation is that, when you ask somebody to manually create training data for you, they will make an effort to write these sentences correctly, following the spelling and punctuation norms of your language. Even if some errors appear, they will be minimal, because they are trying to do things right —this is, to provide “orthographically right” sentences.

Yet, the real world shows us that this is not how users actually write. Our chatbots’ logs are full of hardly understandable queries, spelling mistakes, missing or wrong punctuation… And you can’t force your potential users to adjust to the norms just to be understood by your chatbot, can you? So, which option do you have?

Bitext’s option has been to analyze the language found in a great amount of logs, identify the most common variations to the norm that appear in them, and reproduce them (optionally) in our training datasets. So, in our Free Retail Dataset we have included a proportion of “noisy” text that will make your training text much more similar to the queries you will receive, and so your chatbots will understand even more of your users’ needs.

Since reality has noise, a noisy text is a realistic text… and we can give you that!

                                     DOWNLOAD DATASET


For more information:


Subscribe Here!