11th-century glyphs classified through AI research project

Artificial intelligence can detect faces, grocery items, and even potentially poisonous mushrooms. So why not historical graffiti?

In a paper published on the preprint server Arxiv.org ("Open Source Dataset and Machine Learning Techniques for Automatic Recognition of Historical Graffiti"), researchers at the National Technical University of Ukraine and Huizhou University's School of Information Science and Technology describe a machine learning model that detects, isolates, and classifies ancient letters carved on the stone walls of a Kiev cathedral.

"[C]arved handwriting has usually much worse quality and shabby state to provide the similar values of accuracy ... Usually, the preprocessing requires a priori knowledge about the entire glyph, but [certain] datasets are not available at the moment as open source databases ..." the team wrote. "The main aim of this paper is to apply some machine learning techniques for automatic recognition of ... historical graffiti ... and estimate their efficiency in the view of the complex geometry, barely discernible shape, and low statistical representativeness."

The researchers focused the bulk of their efforts on Glagolitic and Cyrillic, two alphabets commonly used in Eastern Slavic visual texts. Archeologists found glyphs from both -- some dating back to the 11th century -- on the St. Sophia Cathedral in Ukraine. To date, about 7,000 have been detected and studied.

It goes without saying that historical letter datasets aren't as common as, say, those for the Arabic alphabet, so the team assembled and preprocessed a collection of more than 4,000 images for 34 letter types. They used notMINST, a second database containing publicly available fonts and glyphs for the letters A-J, to compare the two outputs.

They next embarked on training a convolutional neural network -- a type of machine learning algorithm that's commonly used in computer vision -- to recognize the graffiti by feeding it data from the notMINST and their novel dataset, taking care to horizontally and vertically flip some of the original images so as to prevent overfitting.

The neural net was 99 percent accurate in isolating characters from the team's dataset and notMINST, respectively.

In future, the researchers hope to improve the model by "teaching" it to consider factors like date, language, authorship, authenticity, and meaning. Furthermore, they propose the creation of larger datasets shared "around the world" in the spirit of "open science, volunteer data collection, processing, and computing," which they say will lead to further advancements.

"[G]raffiti are very powerful source of historical knowledge. [F]or example, the only known source of the Safaitic language is graffiti inscriptions on the surface of rocks in southern Syria, eastern Jordan and northern Saudi Arabia," they wrote. "The recent progress of computer vision and machine learning methods allows applying some of them to improve the current recognition, identification, localization, semantic segmentation, and interpretation of such historical graffiti of various origin ..."

More