Researchers in Italy recently published the findings of a project called Codice Ratio, or “The Code System,” which demonstrates how deep learning is helping the Vatican transcribe a portion of their archives.
The Vatican Secret Archives are made up of some fifty-three miles of shelving, thirty-five thousand volumes of the catalog, and twelve centuries worth of documents. The records contain private letters and other records of past popes, including letters and documents from Michelangelo, Henry VIII requesting a marriage annulment, and even letters from Abraham Lincoln to the Pope.
Like many historic archives from the around the world, the Vatican has their own photographic and conservation team, however, the records are so large, that the idea of transcribing them by hand seems highly impractical.
“Our approach requires minimal training efforts, making the transcription process more scalable as the production of training sets requires a few pages and can be easily crowdsourced,” the researchers said. “Our system has been able to produce good transcriptions that can be used by paleographers as a solid basis to speedup the transcription process at a large scale.”
For training and inference, the researchers used an NVIDIA GeForce GTX GPU, CUDA and the cuDNN-accelerated TensorFlow deep learning framework. The team trained their neural network on images and documents provided to them by the Vatican, to recognize entire words rather than letters.
The researchers were very successful, generating the exact transcription for 65% of the word images of their dataset.
The team says they will continue to improve their training data set, obtaining more documents from the archives to improve their system.