A language can offer meaningful clues into how a culture views the world and its place within it, representing a lived body of knowledge. Every culture has something to say, so understandably, it’s a collective tragedy for the whole of humanity when a language goes extinct, and we lose a part of the beautiful, metaphorical tapestry that is the human experience.
But what if there was a way to automatically recover these lost languages? Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have found a way to use machine learning to help us decode dead, “undeciphered” languages — which means we would finally be able to understand the grammar, vocabulary, and syntax underlying the written versions of these lost languages. In particular, the research team focused on texts that were written with few or no spaces in between the words — a phenomenon that’s called scriptio continua.
“Our work is about automatic decipherment of lost languages written in an under segmented or unsegmented script — apparently for some ancient languages, word dividers had not been invented, or not consistently applied,” said Jiaming Luo, co-author of the study. “The significance of our work lies in the fact that ours is the first attempt to do such decipherment automatically using machine learning in such challenging situations.”
Finding Linguistic Cousins
Typically, in order to crack the code of an unknown language, it’s helpful to know at least another language that’s related. For instance, years ago experts were able to decipher Gothic, an extinct East Germanic language, thanks to its relatedness to known languages like Proto-Germanic, Old Norse and Old English. Inspired by this concept, the team developed their decipherment algorithm along similar lines, an earlier version of which was introduced last year in a previous paper.
“Our machine learning model works by trying to match as many word pairs as possible, between the ancient language and some known one, while handling the uncertainty in segmentation,” explained Luo. “What exactly counts as a matched pair depends on their sound correspondences on the character level, and how regular these correspondences are. For instance, if you find many pairs with a consistent change like p to b, then you are fairly confident that these pairs are truly matched. Why does this work? Because historical linguistics tells us that language changes happen in regular and consistent ways. If two languages are truly related (for example, as Spanish and Italian are), then you would see these patterns emerge over and over again.”
In addition to being able to incorporate these linguistic tendencies, the model handles the uncertainties that comes with unsegmented text by “embedding” the language sounds into an imaginary multidimensional space, where the variations in pronunciation are represented as distances between points in this space. By using this kind of framework, the model is able to detect patterns in the evolution of related languages, thus allowing it to segment out and separate words in undeciphered languages, and map them to words in known, related languages.
As outlined in the team’s paper, this relatedness between known, deciphered languages and undeciphered languages can be used as a kind of baseline, a “ground truth” to help determine whether such AI-powered decipherment models are actually working. In this study, the team leveraged known relationships between Gothic and Ugaritic, a Semitic language somewhat similar to ancient Hebrew, in order to test out how their model would perform on unknown languages, such as Iberian. Through this process, the team used their machine learning model to corroborate that Iberian was very likely not, in fact, related to Basque, as well as other possibilities like Germanic, Turkic, and Uralic languages, a conclusion that is supported by other recent findings.
While the model appears to work well in evaluating how related two languages might be, the team is now aiming to expand the model beyond its current capabilities so that it can juggle multiple, potentially unrelated languages. For now, the team hopes that their model can help automate and take out some of the guesswork out of what is usually a long, tedious process.
“Our work could be useful for linguists to get a quick analysis of the relationship between two languages, especially when one of them is unknown,” said Luo. “It is by no means as adequate or thorough as human analysis, but it’s much much quicker and requires much much less human effort.”
Read more in the team’s paper.
Images: Photo by SerinusCanaria via Pexels; MIT CSAIL