Has your name or surname ever been mocked? In many languages, people are really embarrassed to have unusual names that make other people laugh. Some names and surnames are treated even as insulting or offensive in some cases.
Some users were denied access to online registers because their names were identified as offensive words. Natalie Weiner, writer and editor from New York, tweeted the moment when she wanted to create an account with a website and her last named was deemed offensive. Plenty of people replied to her Tweet with similar stories showing compassion for her. Mike Dickman and James Butts, among others, sympathized with Natalie since they also knew that situation first-hand.
The same happened to Bernhard Dick when he tried to register on the website of a software company. As you can see in the image below, when he entered his surname, the following warning message appeared:
There is no doubt that the use of profanity is prohibited in many websites but something is failing, at least in cases like those mentioned above. Detecting offensive words in registration dialog boxes or messages posted online is a process relying on keyword-matching techniques. Generally, websites make use of supervised ML-based classification methods to learn the target pattern through labeled training data.
Nevertheless, such procedures are not suitable for a correct language detection since they do not take into account the context in which an offensive word appears, for instance. Likewise, they ignore any linguistic characteristic or sentence structure that may be helpful to avoid any name misunderstanding.
A tidy solution here would be applying an anonymization method to entities previously recognized. Data anonymization is a process of detecting and removing sensitive data from a document while keeping its original format.
The Bitext anonymization services, including GDPR, are based on an entity extraction technique in which each entity is classified according to its grammatical attributes and position in a sentence. There is a tool that extracts the relevant named entities (personal names, places, companies, addresses, dates, phone numbers, etc.) and offensive language (insulting, rude, vulgar, swear words...) replacing them with an expression.
If performed correctly, anonymization is definitely the best method to ensure the safety of data collected on the net. To do that, not only must the spelling of a word be considered, but also its linguistic attributes which make it different from its homographs, avoiding such blunders.