The Berkeley Parser
Parsing is the task of analyzing the grammatical structure of natural language. Given a sequence of words, a parser forms units like subject, verb, object and determines the relations between these units according to some grammar formalism. Our work has focused on learning probabilistic context-free grammars (PCFGs) which assign a sequence of words the most likely parse tree. For example:
|
The Golden Gate bridge is red. |
|
If you haven't already tried the parsing demo from the demo bar above, now would be a good time. You can type your own sentence or just click on any sentence on the webpage and our parser will compute and render the most likely parse tree for you. The Berkeley Parser is the most accurate and one of the fastest parsers for a variety of languages. It is available for download at the bottom of this page.
Learning Grammars
Parsers are typically trained on a standardized collection of hand parsed sentences. While a variety of techniques have been developed to learn accurate grammars from those examples, most of them require additional, linguistically motivated annotations. In contrast, we follow an entirely data-driven approach, which does not require any additional human input. Beginning with the barest possible initial structure, we automatically refine the grammar, introducing more complexity only where needed.
Even though there is no human input, because of the way the latent structure is induced, our method reliably learns subcategories which exhibit most of the linguistically motivated splits discussed by previous work, and also captures other, additional linguistic phenomena. As an example of how our staged training proceeds we show the three most likely words for the learned subcategories of the determiner (DT) tag below. In the first step demonstratives are split from determiners, then quantificational elements are split from demonstratives along one branch and definites from indefinites along the other, forming finer and finer syntactico-semantically coherent subcategories. Besides being automatically learned and linguistically interpretable, our models are very sparse. Our largest model is an order of magnitude smaller than the previous state of the art model for parsing English and yet our model has a superior parsing accuracy.
Inference
Once we have learned a grammar for a language, we must select a parse for a given sentence. Even though our grammars are compact compared to other high accuracy grammars, we need to constrain the search space of possible parses in some way. Our parser implements a hierarchical coarse-to-fine scheme which allows us to find the best parse tree very efficiently. We construct a sequence of increasingly refined grammars by reversing the splits which were induced during training. We then parse with each refinement and use its predictions to limit the space of candidate parses, which will be considered by the finer grammars. Below you can see an example of the possible parses after each refinement (black=high probability). One can clearly see how the parser narrows down to its final prediction.
The sentence used to generate this example is:
Influental members of the House Ways and Means Committee introduced legislation that would restrict how the new s&l bailout agency can raise capital, creating another potential obstacle to the government's sale of sick thrifts.
You can use the demo above to see the final parse of the sentence.
Software
The parser is written in Java. You can download a jar-archive containing the source code and a compiled version. We also provide some trained grammars for different languages. The current release includes grammars for parsing English, Chinese, French and German.
The parser is licensed under GNU GPL. Note that this is the full GPL - which allows its use for research purposes or other free software projects but does not allow its incorporation into any type of commercial software, even in part or in translation.
Please download the files from the Google Code repository for the BerkeleyParser:
http://code.google.com/p/berkeleyparser/
The old (and now obsolete) release is available here.
Publications
!IS_A_LIST Two Languages are Better than One (for Syntactic Parsing), David Burkett and Dan Klein, In proceedings of EMNLP 2008. [pdf]
!IS_A_LIST Discriminative Log-Linear Grammars with Latent Variables, Slav Petrov and Dan Klein, In proceedings of NIPS 2008. [pdf] [slides] [bib]
!IS_A_LIST The Infinite PCFG using Hierarchical Dirichlet Processes, Percy Liang, Slav Petrov, Michael Jordan, and Dan Klein, In proceedings of EMNLP 2007. [pdf] [slides]
!IS_A_LIST Learning and Inference for Hierarchically Split PCFGs, Slav Petrov and Dan Klein, In proceedings of AAAI (Nectar Track) 2007. [pdf] [slides] [bib]
!IS_A_LIST Improved Inference for Unlexicalized Parsing, Slav Petrov and Dan Klein, In proceedings of HLT-NAACL 2007. [pdf] [slides] [bib]
!IS_A_LIST Learning Accurate, Compact, and Interpretable Tree Annotation, Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein, In proceedings of COLING-ACL 2006. [pdf] [slides] [bib]
!IS_A_LIST Non-Local Modeling with a Mixture of PCFGs, Slav Petrov, Leon Barrett, and Dan Klein, In proceedings of CoNLL 2006. [pdf] [slides] [bib]
!IS_A_LIST Corpus-Based Induction of Syntactic Structure: Models of Dependency and Constituency, Dan Klein and Chris Manning, In Proceedings of the Association for Computational Linguistics (ACL) 2004. [pdf]
!IS_A_LIST A Generative Constituent-Context Model for Improved Grammar Induction, Dan Klein and Chris Manning, In Proceedings of the Association for Computational Linguistics (ACL) 2002. [pdf]
!IS_A_LIST Natural Language Grammar Induction Using a Constituent-Context Model, Dan Klein and Chris Manning, In Advances in Neural Information Processing Systems (NIPS) 2001. [pdf]
!IS_A_LIST Distributional Phrase Structure Induction, Dan Klein and Chris Manning, In Proceedings of the Conference on Natural Language Learning (CoNLL) 2001. [pdf]
!IS_A_LIST Accurate Unlexicalized Parsing, Dan Klein and Chris Manning, In Proceedings of the Association for Computational Linguistics (ACL) 2003. [pdf]
!IS_A_LIST Factored A* Search for Models over Sequences and Trees, Dan Klein and Chris Manning, In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) 2003. [pdf]
!IS_A_LIST A* Parsing: Fast Exact Viterbi Parse Selection, Dan Klein and Chris Manning, In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL) 2003. [pdf]
!IS_A_LIST Parsing and Hypergraphs, Dan Klein and Chris Manning, Bunt, Carroll, and Satta, eds. , New Developments in Parsing Technology, Kluwer Academic Publishers 2002.
!IS_A_LIST Parsing and Hypergraphs, Dan Klein and Chris Manning, In Proceedings of the International Workshop on Parsing Technologies (IWPT) 2001. [pdf]
!IS_A_LIST Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank, Dan Klein and Chris Manning, In Proceedings of the Association for Computational Linguistics (ACL) 2001. [pdf]
!IS_A_LIST An O(n^3) Agenda-Based Chart Parser for Arbitrary Probabilistic Context-Free Grammars, Dan Klein and Chris Manning, Stanford Technical Report 2001. [pdf]
!IS_A_LIST Max-Margin Parsing, Ben Taskar, Dan Klein, Michael Collins, Daphne Koller, and Chris Manning, In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) 2004. [pdf]
!IS_A_LIST Fast Exact Inference with a Factored Model for Natural Language Processing, Dan Klein and Chris Manning, In Advances in Neural Information Processing Systems 15 (NIPS) 2002. [pdf]

