The Berkeley NLP Group

Historical Linguistics

Languages evolve over time, with words changing in form, meaning, and the ways in which they can be combined into sentences. Several centuries of linguistic analysis have shed light on some of the key properties of this evolutionary process, but many open questions remain. The study of how languages change over time is known as diachronic (or historical) linguistics. Most of what we know about language change comes from the comparative method, in which words from different languages are compared in order to identify their relationships. The goal is to identify regular sound correspondences between languages, and use these to infer the forms of proto-languages and the phylogenetic relationships between languages. The motivation for basing the analysis on sounds is that phonological changes are generally more systematic than syntactic or morphological changes. Comparisons of words from different languages are traditionally carried out by hand, introducing an element of subjectivity into diachronic linguistics. Early attempts to quantify the similarity between languages made drastic simplifying assumptions that drew strong criticism from diachronic linguists. In particular, many of these approaches simply represent the appearance of a word in two languages with a single bit, rather than allowing for gradations based on correspondences between sequences of phonemes. We take a quantitative approach to diachronic linguistics that alleviates this problem by operating at the phoneme level. Our approach combines the advantages of the classical, phoneme-based, comparative method with the robustness of corpus-based probabilistic models. We estimate a contextualized model of phonological change expressed as a probability distribution over rules applied to individual phonemes:

The model is fully generative, and thus can be used to solve a variety of problems. For example, we can reconstruct ancestral word forms or inspect the rules learned along each branch of a phylogeny to identify sound laws. Alternatively, we can observe a word in one or more modern languages, say French and Spanish, and query the corresponding word form in another language, say Italian. Finally, models of this kind can potentially be used as a building block in a system for inferring the topology of phylogenetic trees. See below for an example of reconstruction of phonological rules (see paper for details):

Corpus, version 1.0

In order to train and evaluate our system, we compiled a corpus of Romance cognate words. The raw data was taken from three sources: the wiktionary. org website, a Bible parallel corpus (Resnik et al. , 1999) and the Europarl corpus (Koehn, 2002). From an XML dump of the Wiktionary data, we extracted multilingual translations, which provide a list of word tuples in a large number of languages, including a few ancient languages. The Europarl and the biblical data were processed and aligned in the standard way, using combined GIZA++ alignments (Och and Ney, 2003). We performed our experiments with four languages from the Romance family (Latin, Italian, Spanish, and Portuguese). For each of these languages, we used a simple in-house rule-based system to convert the words into their IPA representations. After augmenting our alignments with the transitive closure of the Europarl, Bible and Wiktionary data, we filtered out non-cognate words by thresholding the ratio of edit distance to word length. The preprocessing is constraining in that we require that all the elements of a tuple to be cognates, which leaves out a signicant portion of the data behind. However, our approach relies on this assumption, as there is no explicit model of non-cognate words. An interesting direction for future work is the joint modeling of phonology with the determination of the cognates.

Resources

Make sure you read these using the UTF-8 encoding. The tool used to apply the rules, perform the cognate detection and transitive closure is available upon request (bouchard [at] cs [dot] berkeley) and should be released with documentation soon.

Processed cognates: the processed set of cognates used for our emnlp paper.
Raw data: the raw data used to create the above cognates.
Rules: the rules used to convert to IPA.