Data scarcity is one of the major bottlenecks that AI practitioners have to deal with when training production-level models. Obtaining additional data typically involves costly manual annotation processes which, as we described in a previous post are fraught with problems.
Because of this, one of the most common ways of dealing with data scarcity is the use of Transfer Learning. At a high level, transfer learning is the process of training an AI model to solve a specific problem, and then reusing the knowledge encoded in that model (usually in the form of internal representations) to solve a related problem, usually by training the existing model with examples of the related problem.
In the context of NLP, Transfer Learning has become quite popular. One classic example is the use of pre-trained representations in the form of the word embeddings generated by word2vec, GloVe or fastText. In many cases, NLP AI/ML models built on top of word embeddings slightly outperformed those trained from scratch (usually by a few points). More recently, pre-trained languages models (such as BERT), which take into account the context of words, have replaced word embeddings as the starting point when training models to perform tasks ranging from Named Entity Recognition (NER) to intent detection.
While these techniques produce small incremental improvements in performance, they are often not accessible for non-technical end users such as, say, someone building a chatbot on a commercial conversational AI platform (where they don’t have access to the underlying models).
To address this, at Bitext we are solving the data scarcity problem by having the machines generate the training data. Bitext’s Synthetic Data Service will create all the training data you need, tailored to your needs, in a much shorter time that you would possibly get via manual work, and in a way that is easily reusable for you whenever you need it. Visit us and download a test dataset to check if synthetic/artificial data works for your case.