In some of our recent talks, colleagues have asked us about the Stanford parser and how it compared to Bitext technology (namely at our last workshop on Semantic Analysis of Big Data in San Francisco, and in our presentation in the Semantic Garage also in San Francisco).
We have revisited this parser and, as expected, results were impressive. Sentences taken from news sources get parsed elegantly and with a nice dependency tree for their constituents. The parser is based on a probabilistic approach, i.e., it needs to be trained using a hand-tagged corpus with POS (Part of Speech) tags; typically, the Wall Street Journal corpus from Penn Treebank. The question arises whether this approach is effective when a different type of text needs to be addressed (such as Social Media content, User Generated Reviews…). In this case, it is necessary to hand-tag new corpora for training the parser. Let's think about parsing tweets, which often feature ungrammatical sentences, slang, abbreviations, emoticons... and if this needs to be done in languages other than English, hand-tagging corpora for training a parser can be a hurdle for fully automating tasks like Social Media Analysis.
In Bitext, we have had to face this situation since our customers deal with many different types of data and in multiple languages. We follow a linguistic approach whereby we can quickly develop grammars which describe the dependency structure of sentences at various level of detail, depending on whether we analyze Social Media content or News texts. Our linguistic parser is grammar-independent and dictionary-independent, and for a new language we can use the same parser just changing the linguistic data sources and, most importantly, we do not require hand-coded corpora to “train” our parser since it does not require any training. This is the downside; on the upside, the probabilistic information associated to rules can prove very useful. We are looking into ways to incorporate statistical information in our grammars so that the parser can select the correct analysis in cases where structural ambiguity is pervasive. A hybrid approach combining linguistic knowledge and statistical information could help in solving these sentences, which are more frequent than we usually think.