title bar

Machine Translation Projects

We work on a variety of sub-problems related to machine translation, including word alignment, estimation, decoding and synchronous parsing . Currently, we are also building an in-house end-to-end syntactic translation system.

Translation Model Estimation

Applying discriminative training techniques for structured output domains, we investigated using the perceptron algorithm to learn millions of feature parameters in a translation system, including translation table probabilities and hyperparameters.

Machine translation does not have traditional supervised data sets: reference translations are not the only correct outputs of a system and do not provide information about sentence segmentation or transfer components. Additionally, reference sentences cannot always be generated by a standard phrase-based system's translation table. We considered three update strategies that address these issues:

!IS_A_LIST Bold updating: Update toward the highest scoring way of generating the reference sentence.

!IS_A_LIST Local updating: Update toward the item on a system's n-best list with the heighest BLEU score (or other error metric).

!IS_A_LIST Hybrid updating: Update with bold old updating when the reference is reachable and local updating otherwise.

Updating strategies

We have also explored generative techniques for estimating translation parameters. We studied why learning a phrase-level conditional translation model underperforms a model estimated heuristically from relative frequency statistics. The generative process underlying our model echoes assumptions made for phrase-based decoding.

Generative phrase-based model

The type of overfitting exhibited by this models, where equally appropriate rules of different sizes compete for probability mass, appears in other NLP areas such as data-oriented parsing and remains an engaging area for future research.

Synchronous Parsing

Parsing a bitext under a synchronous grammar is a task related to machine translation which underlies parameter estimation and learning syntax-aware word alignments. However the asymptotic complexity, a sixth-order polynomial under many parsing formalisms, imposes severe computational challenges.

We have developed a general method for finding admissible A* heuristics in complex search domains that project onto multiple simpler domains. We applied this technique to synchronous parsing, projecting two monolingual PCFGs from a synchronous PCFG that together provide admissible estimates of the completion cost of a synchronous parse.

Synchronous parsing heuristic

In our NAACL paper (Haghighi et. al. , 2007), we also apply this technique to lexicalized parsing.

Syntactic Machine Translation

We believe that natural language syntax plays an important role in the translation process and will likewise prove a useful source of information for machine translation. While all successful machine translation systems encode aspects of syntax implicitly, we are considering approaches that explicitly generate English syntax trees during the translation process.

Presently, we are focusing on leveraging our advances in parsing and word alignment to improve translation quality. Our preliminary results are promising, and we look forward to presenting them to the community soon.

Our end-to-end system, currently under development, is modeled largely after the successful system of our primary collaborators at the Information Sciences Institute NLP Group. We hope to publicly release components of this system in the near future.

Acknowledgments

Support for machine translation research in the NLP group includes an NSF CAREER award (#IIS-0643742), a Microsoft New Faculty Fellowship, and a grant from the Hellman Family Faculty Fund.

Publications

!IS_A_LIST Sampling Alignment Structure under a Bayesian Translation Model, John DeNero, Alex Bouchard-Côté, and Dan Klein, In proceedings of EMNLP 2008. [pdf]

!IS_A_LIST Fully Distributed EM for Very Large Datasets, Jason Wolfe, Aria Haghighi, and Dan Klein, In proceedings of ICML 2008. [pdf] [slides]

!IS_A_LIST Learning Bilingual Lexicons from Monolingual Corpora, Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein, In proceedings of ACL 2008. [pdf] [slides]

!IS_A_LIST The Complexity of Phrase Alignment Models, John DeNero and Dan Klein, In proceedings of ACL Short Paper Track 2008. [pdf] [slides]

!IS_A_LIST A* Search via Approximate Factoring, Aria Haghighi, John DeNero, and Dan Klein, In proceedings of AAAI (Nectar Track) 2007. [pdf]

!IS_A_LIST Tailoring Word Alignments to Syntactic Machine Translation, John DeNero and Dan Klein, In proceedings of ACL 2007. [pdf] [slides]

!IS_A_LIST Approximate Factoring for A* Search, Aria Haghighi, John DeNero, and Dan Klein, In proceedings of HLT-NAACL 2007. [pdf] [slides] [bib]

!IS_A_LIST An End-to-End Discriminative Approach to Machine Translation, Percy Liang, Alexandre Bouchard-Côté, Dan Klein, and Ben Taskar, In proceedings of COLING-ACL 2006. [pdf] [slides] [bib]

!IS_A_LIST Why Generative Phrase Models Underperform Surface Heuristics, John DeNero, Dan Gillick, James Zhang, and Dan Klein, Workshop on Statistical Machine Translation at HLT-NAACL 2006. [pdf] [slides] [bib]

!IS_A_LIST Alignment by Agreement, Percy Liang, Ben Taskar, and Dan Klein, In proceedings of NAACL 2006. [pdf] [slides] [bib]

!IS_A_LIST Word Alignment Via Quadratic Assignment, Simon Lacoste-Julien, Ben Taskar, Dan Klein, and Michael Jordan, In proceedings of NAACL 2006. [pdf] [bib]

!IS_A_LIST A Discriminative Matching Approach to Word Alignment, Ben Taskar, Simon Lacoste-Julien, and Dan Klein, In proceedings of EMNLP 2005. [pdf] [bib]

Site designed by John DeNero