Last week we introduced you a controversial topic: how to achieve higher accuracy while doing text categorization? We explained the two most popular approaches to approach this challenge, namely keyword matching and linguistics.
However, although we explained how both of them try to solve the disambiguation problem, we focused only on simple phrases, which don’t fully illustrate the real differences. That is why in today’s post we want to delve more deeply into the topic.
Keyword matching approach:
When it comes to analyzing or categorizing complex phrases, this approach employs Boolean searches as its underlying technology. In the past, this approach imposed some restrictions in terms of the number of keywords you could include in your search. However, nowadays the Boolean approach has become considerably more flexible, and offers some advantages in terms of customization.
First, you need to select the keywords you are interested in and then include the Booleans operators that will adapt the search to your needs. The most frequent ones are: AND, OR and NOT. You can also include modifiers such as asterisk *, parentheses () and quotation marks “”.
So for example while writing a query containing the Boolean OR and the keywords “hotel” and “bed” you will get text containing either of the two:
- “I really enjoyed my hotel in New York”
- “I feel very lazy to leave my bed”
- “I am writing this review from my bed: I arrived at the hotel August 11th…”
For more complex searches we can use the NEAR operator, which allows us to control the distance between the keywords. For example, “bed NEAR/2 hotel” means “bed” has to appear within 2 words of “hotel”; otherwise, the text won’t be considered a match.
If we consider the last example sentence, we can see that it won’t be included in the results, since “hotel” and “bed” are separated by more than 2 words.
As you can see, this Boolean approach allows the user to be flexible when writing the rules. However, once you start writing complex rules using NEAR operator, the results are often not very accurate.
Why? The answer is simple: because we are not considering linguistics. To improve accuracy, we need to consider the nature of each word; otherwise, the results won’t include what you are looking for.
In the linguistics approach, the Booleans approach is discarded, since it introduces a lot of noise. What does this mean? Linguistics allows you to obtain more accurate results because the results only contain the matches you are interested in.
How is this done? By using POS tagging and lemmatization software.
In case you didn’t read our previous article, taking advantage of POS tagging allows us to write rules that will only match the specified words depending on their use.
- Can;sust → we are only interested in text containing the word “can” as a noun not as a verb.
- Parking;verb → we are only interested in the word “parking” when it acts as a verb.
Using this linguistic rules you can achieve a deeper level of detail that the one achievable with Booleans. It will help you leverage every piece of information that you were able to extract with the AND, OR and NOT. And the good news is that you can even forget about the NEAR as it gets handled by our parsing engine.
Let’s see some examples:Book AND online will categorize you this ones:
- “I always book online.”
- “I am reading my book online”
- “I lost my book and now I cannot event find it online.”
- “I book over the phone and only go online to check prices.
But book;verb,online;adv will not. Instead it will work with this one:
- “I always book online."
It is not magic it is just a matter of good parsing!