The Nvidia team/blog cannot be clearer about the challenges of AI & NLP: “Recent work has demonstrated that larger language models dramatically advance the state of the art in natural language processing (NLP) applications such as question-answering, dialog systems, summarization, and article completion. However, during training, large models do not fit in the available memory of a single accelerator, requiring model parallelism to split the parameters across multiple accelerators.” See details at "State-of-the-Art Language Modeling Using Megatron on the NVIDIA A100 GPU" https://devblogs.nvidia.com/language-modeling-using-megatron-a100-gpu
In short: language models need to grow in size to increase its accuracy but hardware limits are an issue.
This is particularly clear in the world of chatbots/assistants, where training data is typically limited to 100 to 500 hundred utterances per intent. It looks like training with only a few hundred utterances is probably the main obstacle for improving accuracy.
If the Nvidia team is right, and it looks like a straightforward point to us, the next challenge is: where can we get the necessary amounts of training data to build larger language models? In our view, producing training data at the scale required to provide high accuracy cannot be done with the current approach of manual production of training data. Synthetic data (or artificial data) is probably the answer to this need: NLG technology can provide training data at the scale needed by today’s automation industry, in different languages, verticals and language registers (colloquial/formal) or regional registers (UK/US English), etc. For details see: https://www.bitext.com/training-data-for-customer-support-automation/
Antonio Valderrábanos, Ph.D.
CEO & Founder of Bitext