It’s a true story that Germans love their long words. However, this fact may not be so loved for text processing procedures. The lack of NLP libraries in Python adapted to German makes it difficult to properly analyze this kind of words. Let us share with you our NLP tool to split word compounds. It will transform the AI market.
A compounding language is a language that allows making up new words by just joining one after another forming a single unit, like for example German, Dutch, Korean, Norwegian or Swedish. They pose a challenge for language processing tools since most of them are based on limited lexicons that cannot cover all possible inflections or know how compounds are formed. As a result, splitting these words turns out to be a crucial pre-processing step for a high-quality lemmatization process in these languages.
Accordingly, a decompounding tool is critical for search applications, where applying decompounding before indexing and searching can significantly increase recall. There is, therefore, a need to decompose compounds so that their coverage increases and out-of-vocabulary terms are reduced. At Bitext, a lexical analyzer was developed to offer an extensive support for languages such as German (including the Swiss variant), Dutch (including Belgian), Korean, Norwegian Bokmål, Norwegian Nynorsk and Swedish.
In German, for example, in order to split the word Abwasserbehandlungsanlagen, which means sewage treatment plants, our tool reverses the German rules for compounding and breaks the word into its basic components, Abwasser+Behandlung+s+Anlagen. Additionally, the tool can lemmatize the whole compound, Abwasserbehandlungsanlage, or each of its components, Abwasser + Behandlung + Anlage.
Let’s imagine the case for a search application: if there are several documents containing the word Abwasserbehandlungsanlage, and a user searches for Abwasserbehandlung (sewage treatment), those documents will not be returned. However, if during indexing, the compound is broken into the lemmas Abwasser, Behandlung and Anlagen, then searching for Abwasserbehandlung will return the right results.
Moreover, the Bitext Lexical Analyzer is highly configurable, based on the desired application. Its main configurable aspects are:
- Lexicalized compounds: by default, the decompounder will not split compounds such as Wörterbuch (dictionary) or Rindfleisch (beef), which are extremely common and have become lexicalized (they appear in dictionaries). However, in some applications, it may be useful to split these compounds, so our decompounder can optionally do this.
- Case sensitivity: by default, the decompounder enforces case sensitivity (since German nouns are capitalized), but this may be disabled in cases where the input text is from an informal source.
- Alternative spellings: in German, the digraph ss and the letter ß are interchangeable in some contexts but not in others. By default, the Lexical Analyzer will return the lemma using the same spelling present in the form (except for cases where this is not valid, such as ißt → essen). This can be configured as needed, such as for dialects like Swiss German, which entirely eliminates the use of ß.
- Compounds in the different language variants: the word Chuchichäschtli (kitchen cupboard) from Swiss German which is not used in Standard German.
Improve your solutions by adding NLP tools based on a linguistic analysis to get the results you’re expecting. Why not see it for yourself? Try our decompounding tool in our API.
Are you missing any languages above? We love challenges, just let us know about it and your wishes will come true.