Table of Contents
Advances in Artificial Intelligence
Volume 2012, Article ID 484580, 15 pages
http://dx.doi.org/10.1155/2012/484580
Research Article

Learning to Translate: A Statistical and Computational Analysis

1European Commission-Joint Research Centre (JRC), IPSC, GlobeSec, Via Fermi 2749, 21020 Ispra, Italy
2Intelligent Systems Laboratory, University of Bristol, MVB, Woodland Road, Bristol BS8 1UB, UK
3Interactive Language Technologies, National Research Council Canada, 283 Boulevard Alexandre-Taché, Gatineau, QC, Canada J8X 3X7

Received 15 July 2011; Accepted 30 January 2012

Academic Editor: Peter Tino

Copyright © 2012 Marco Turchi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

We present an extensive experimental study of Phrase-based Statistical Machine Translation, from the point of view of its learning capabilities. Very accurate Learning Curves are obtained, using high-performance computing, and extrapolations of the projected performance of the system under different conditions are provided. Our experiments confirm existing and mostly unpublished beliefs about the learning capabilities of statistical machine translation systems. We also provide insight into the way statistical machine translation learns from data, including the respective influence of translation and language models, the impact of phrase length on performance, and various unlearning and perturbation analyses. Our results support and illustrate the fact that performance improves by a constant amount for each doubling of the data, across different language pairs, and different systems. This fundamental limitation seems to be a direct consequence of Zipf law governing textual data. Although the rate of improvement may depend on both the data and the estimation method, it is unlikely that the general shape of the learning curve will change without major changes in the modeling and inference phases. Possible research directions that address this issue include the integration of linguistic rules or the development of active learning procedures.