There are different approaches for text categorization, the most popular one is based on keyword matching however here at Bitext we use another approach that has made us unique, and it’s based on linguistics.
Both can deliver results and keywords based categorization systems usually permit to perform boolean query’s - known by all- to refine the rules. Why then we took another and less well established way? All to deliver the highest possible accuracy. Linguistics is the approach when you want to easily disambiguate everyday occurrences as for example the word “can” in “I want a can of soda” from “I can do anything”.
By using keyword matching approach, we can set a rule saying when finding the words car and like in a phrase it should be categorized as “customer satisfaction”.
In the phrase “I really like my new car, it’s a hybrid one” the categorization following the previous rule will be done correctly, however if you have the phrase “My neighbor’s car is like mine” again the two words will appear in the phrase, and while the tool will categorize it in the same group as the first one this will be wrong.
The problem in using keyword matching is that we have a clear aim when we set up rules thinking in words, but we do not think in what there is behind the word.
The linguistic approach on the other hand uses part of speech tagging or POS while setting the rules for text categorization.
In our platform we differentiate between four types of POSes:
- sust (for nouns)
- adj (for adjectives)
- verb (for verbs)
- adv (for adverbs)
When the time to write the rules come what we do is saying just include in “customer satisfaction” category phrases where car is used as a substantive and like as a verb. Going back to the previous example “my neighbor’s car is like mine” “like” in this case is not a verb so the phrase won’t be included in the category.
But how do we use the POSes?
Let’s start with an easy example and take the word “can” and the phrases “I want a can of soda” and “I can do anything”.
If we are interested in the word as a verb we should create the rule can;verb and we will be able to extract all the phrases in where can is acting as a verb.
On the contrary if you are just looking for phrases in where can it’s used as a noun you will have to write can;sust.