A client came to us because their bot couldn’t understand customer requests like: “I’ve been a client for five years and my daughter wanted that present so badly; but I haven’t received it yet!". It was impossible for their bot to understand such a combination of knowledge items. That’s how we started implementing query segmentation.Segmentation is key for bot understanding. The nature of human speech often than not has few signs of clear structure and obvious separation of phrases is, at times, missing.
That makes it very difficult for bots to comprehend what is being said and, more importantly, to be able to transform these words into definite requests. So segmentation is simply splitting what people say into shorter sentences that a bot can handle.
However, despite the fact that segmentation is crucial for bot understanding, it does have some flaws. Understanding how people speak and what they expect from their words is not an easy task. For example, how can we teach a computer to distinguish between when a customer has written a period because he finished the sentence or when he used it in an abbreviation?
The kind of problems segmentation overcomes:
- Identifying the end of a sentence
- Recognizing that each language has its own punctuation rules for the same symbols
In addition to the previously mentioned issue, it happens that the same marks apply different punctuation rules in each language. For example, in English we don't usually write a dot after an ordinal number ("February 3rd") but German speakers do ("3. Februar"). In Spanish, there's an opening inverted question mark as well as the closing one ("¿Quién viene?") and in Arabic the question mark goes at the beginning of a sentence ("من هو الشخص الذي يأتي؟"). So a complete and meaningful segmentation must be language sensitive.
In fact, if we get even more into detail, you will see that not even mark’s symbols are the same in all alphabets. This is how dots look:
- Latin dot: “.”
- Korean dot: “·”
- Chinese dot: “。”
How to solve all these problems:
The linguistic approach is the right one even for a seemingly simple task like segmentation. In Bitext we take into consideration all these specifications and have developed the perfect segmentation API, sensitive to each language punctuation rules and marks.
Take a paragraph like "Please read the story (You'll be amazed) After reading it, please proceed." Any simple segmentation tool wouldn’t identify brackets as means of marking the end of a sentence, returning the following analysis:
"Please read the story (You'll be amazed) After reading it, please, proceed."
Instead, Bitext Segmentation service treats the closing bracket as the end of the sentence, and therefore returns a much more useful analysis:
"Please read the story",
"(You'll be amazed)"
"After reading it, please proceed."
Or take this example in German: "Richard Wagner wurde am 22. Mai 1813 geboren." If the segmentator is language indifferent, it will make the following analysis:
"Richard Wagner wurde am 22.",
"Mai 1813 geboren."
However, our tool still works well for German because it uses a special regex (regular expression) file for each language. If you specify it, this is the end result:
"Richard Wagner wurde am 22. Mai 1813 geboren."
Don't underestimate the power of a good segmentation tool. Being able to categorize the natural language (and doing it like a pro!) is crucial for bots transforming our words into requests. Even more, segmentation is a crucial tool for other very demanding tasks like sentiment analysis, summarization or machine translation.
You can try Bitext's segmentation service with your own texts by simply registering in our API platform.