MIT CSAIL's AI revives dead languages it hasn't seen before

Researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) claim to have developed a system that can decipher a lost language without knowing its relation to other languages. The team says this is a step toward a system that's able to decipher lost languages using just a few thousand words.

Lost languages are more than an academic curiosity. Without them, we risk losing a body of knowledge about the people who historically spoke them. Unfortunately, most lost languages left such minimal records that scientists can't decipher the languages using conventional machine-translation algorithms. Some languages don't have a well-researched "relative" language to compare them to, and many do not use traditional dividers like white space and punctuation.

This CSAIL work, which was supported in part by the Intelligence Advanced Research Projects Activity and spearheaded by MIT professor and natural language processing specialist Regina Barzilay, leverages several principles grounded in insights from historical linguistics. For instance, while a given language rarely adds or deletes a sound, certain sound substitutions are likely to occur. A word with a "p" in the parent language may change into a "b" in the descendant language, but changing to a "k" is less likely due to the significant pronunciation gap.

By incorporating these and other linguistic constraints, Barzilay and coauthor Jiaming Luo developed a decipherment algorithm that can handle the vast space of transformations and the scarcity of a signal in the input. The algorithm learns to embed language sounds into a multidimensional space where differences in pronunciation are reflected in the distance between corresponding vectors. This design enables the system to capture patterns of language change and express them as computational constraints. The resulting model can segment words in an ancient language and map them to counterparts in a related language.

With the new system, the relationship between languages is inferred by the algorithm, which can assess the proximity between two languages. Moreover, when tested on known languages, it can accurately identify language families.

The team applied their algorithm to Iberian and considered relationships to Basque, as well as less likely candidates from Romance, Germanic, Turkic, and Uralic families. While Basque and Latin were closer to Iberian than other languages, they were still too different to be considered related, the system revealed.

In future work, the researchers hope to expand their efforts from connecting texts to deciphering related words in a known language, an approach referred to as cognate-based decipherment. This would involve identifying the semantic meaning of the words even if the system doesn't know how to read them. "These methods of 'entity recognition' are commonly used in various text processing applications today and are highly accurate, but the key research question is whether the task is feasible without any training data in the ancient language," Barzilay said.

Barzilay and coauthors aren't the only ones applying AI to the field of lost languages. Alphabet's DeepMind developed a system called Pythia that learned to recognize patterns in 35,000 relics containing more than 3 million words. It managed to guess missing words or characters from Greek inscriptions on surfaces including stone, ceramic, and metal that were between 1,500 and 2,600 years old.