Homophonic Substitution Analyzer in CT2 -Analyzing the infamous Zodiac-408 letter.

Homophonic Substitution Analyzer in CT2 -Analyzing the infamous Zodiac-408 letter.

Source publication
Article
Full-text available
Many historians and linguists are working individually and in an uncoordinated fashion on the identification and decryption of historical ciphers. This is a time-consuming process as they often work without access to automatic methods and processes that can accelerate the decipherment. At the same time, computer scientists and cryptologists are dev...

Contexts in source publication

Context 1
... implemented cryptanalysis algorithms based on hill climbing and simulated annealing, allowing the user to break homophonic substitution ciphers in CT2 (Kopal 2019). Figure 3 shows the Homophonic Substitution Analyzer component in CT2 analyzing the infamous Zodiac-408 letter. The tool allows the user to automatically and semiautomatically analyze homophonically encrypted texts. ...
Context 2
... revealed words can be marked and "locked" for further analysis steps. In Figure 3, locked letters are marked green and automatically found words are marked blue. In future work, we plan to adapt the analyzer in such a way that it is able to analyze homophones of different lengths, e.g. ...

Citations

... The DECRYPT project: aims and tools The long-term goal of the DECRYPT project is to establish a new cross-disciplinary subject of historical cryptology to shed light on the usage, content, and development of historical ciphers throughout the centuries in Europe. In order to do so, the project aims at building a research infrastructure for historical cryptology to collect, digitize, process, and decrypt historical encrypted sources and release these through a web service with information about their provenance and other facts of relevance (Megyesi et al., 2020). To achieve the rather ambitious goal, resources in terms of ciphers and keys along with historical non-encrypted sources, are collected and processed to be analyzed for which tools for transcription and decryption are developed. ...
... Historians and linguists benefit from the decoded documents, leading to new knowledge and better understanding of our history and historical languages; computer scientists, cryptologists, and computational linguists working on developing methods for automatic decryption of various types of ciphers get access to a heterogeneous collection of ciphertexts and code material from linguists and historians, which in turn can lead to new methodological insights in language technology applications. Librarians and archivists get a correct identification and description of the encrypted documents that lie hidden in the collections they are guarding (Megyesi et al., 2020). ...
Article
Full-text available
A widely shared recognition over the past decade is that the methodology and the basic concepts of science and technology studies (STS) can be used to analyze collaborations in the cross-disciplinary field of digital humanities (DH). The concepts of trading zones (Galison, 2010), boundary objects (Star and Griesemer, 1989), and interactional expertise (Collins and Evans, 2007) are particularly fruitful for describing projects in which researchers from massively different epistemic cultures (Knorr Cetina, 1999) are trying to develop a common language. The literature, however, primarily concentrates on examples where only two parties, historians and IT experts, work together. More exciting perspectives open up for analysis when more than two, more nuanced and different epistemic cultures seek a common language and common research goals. In the DECRYPT project funded by the Swedish Research Council, computational linguists, historians, computer scientists and AI experts, cryptologists, computer vision specialists, historical linguists, archivists, and philologists collaborate with strikingly different methodologies, publication patterns, and approaches. They develop and use common resources (including a database and a large collection of European historical texts) and tools (among others a code-breaking software, a hand-written text recognition tool for transcription), researching partly overlapping topics (handwritten historical ciphers and keys) to reach common goals. In this article, we aim to show how the STS concepts are illuminating when describing the mechanisms of the DECRYPT collaboration and shed some light on the best practices and challenges of a truly cross-disciplinary DH project.
... For this reason, we bring to the reader's attention to two interesting projects that focus on such collections. The DECRYPT 1 project (Megyesi et al., 2020;Megyesi et al., 2022) contains 4360 records at the time of writing this paper. Perhaps the only disadvantage of this database is that most of the documents are not publicly available 2 . ...
Conference Paper
Full-text available
In this article, we present encrypted documents and cipher keys from the 18th and 19th century, related to central-European aristocratic families Amade-Üchtritz, Esterházy, and Pálffy-Daun. In the first part of the article, we present an overview and analysis of the available documents from the archives with examples. We provide a short historical overview of the people related to the analyzed documents to provide a context for the research. In the second part of the article, we focus on the digital processing of these historical manuscripts. We developed new tools based on machine learning that can automate the transcription of encrypted parts of the documents, which contain only digits as cipher text alphabet. Our digit detection and segmentation are based on YOLOv7. YOLOv7 provided good detection precision and was able to cope with problems like noisy paper background and areas where digits collided with the text from the reverse side of the paper.
... DECRYPT (https://de-crypt.org/) [10] and HCPortal (https://hcportal.eu) [4,5]. ...
Article
Full-text available
This paper deals with historical encrypted manuscripts and introduces an automated method for the detection and transcription of ciphertext symbols for subsequent cryptanalysis. Our database contains documents used in the past by aristocratic families living in the territory of Slovakia. They are encrypted using a nomenclator which is a specific type of substitution cipher. In our case, the nomenclator uses digits as ciphertext symbols. We have proposed a method for the detection, classification, and transcription of handwritten digits from the original documents. Our method is based on Mask R-CNN which is a deep convolutional neural network for instance segmentation. Mask R-CNN was trained on a manually collected database of digit annotations. We employ a specific strategy where the input image is first divided into small blocks. The image blocks are then passed to Mask R-CNN to obtain detections. This way we avoid problems related to the detection of a large number of small dense objects in a high-resolution image. Experiments have shown promising detection performance for all digit types with minimum false detections.
... Example documents include books from secret societies, diplomatic correspondences, and pharmacological books. Previous work has been done on collecting historical ciphers from libraries and archives and making them available for researchers (Pettersson and Megyesi, 2019;Megyesi et al., 2020). However, decipherment of classical ciphers is an essential step to reveal the contents of those historical documents. ...
... However, these methods all assume that cipher elements are clearly segmented (i.e., that token boundaries are well established). Many historical documents, however, are enciphered as continuous sequences of digits that hide token boundaries (Lasry et al., 2020). An example cipher (the IA cipher) is shown in Figure 1 (Megyesi et al., 2020). ...
... Many historical documents, however, are enciphered as continuous sequences of digits that hide token boundaries (Lasry et al., 2020). An example cipher (the IA cipher) is shown in Figure 1 (Megyesi et al., 2020). Solving those ciphers is very challenging since it is not possible to directly search for the key without finding substitution units. ...
Preprint
Deciphering historical substitution ciphers is a challenging problem. Example problems that have been previously studied include detecting cipher type, detecting plaintext language, and acquiring the substitution key for segmented ciphers. However, attacking unsegmented, space-free ciphers is still a challenging task. Segmentation (i.e. finding substitution units) is the first step towards cracking those ciphers. In this work, we propose the first automatic methods to segment those ciphers using Byte Pair Encoding (BPE) and unigram language models. Our methods achieve an average segmentation error of 2\% on 100 randomly-generated monoalphabetic ciphers and 27\% on 3 real homophonic ciphers. We also propose a method for solving non-deterministic ciphers with existing keys using a lattice and a pretrained language model. Our method leads to the full solution of the IA cipher; a real historical cipher that has not been fully solved until this work.
... This paper compares FE and FL approaches applied to the problem of classification of (a large synthetic set of) classical ciphers, which were historically relevant up to World War II. Current research focuses on the automated digitization, analysis, and decryption of encrypted historical documents [15]. Ciphers (or encryption algorithms) are a method to protect information (plaintext) by converting it to ciphertext. ...
... Ciphertexts with these lengths are seldom. Based on the DECODE database, described in [14,15], the majority of encrypted historic manuscripts is only between a few lines and some pages of ciphertext long. As of the end of May 2021, the DECODE database contains more than 2,600 records of encrypted historic manuscripts and keys. ...
Chapter
Full-text available
To break a ciphertext, as a first step, it is essential to identify the cipher used to produce the ciphertext. Cryptanalysis has acquired deep knowledge on cryptographic weaknesses of classical ciphers, and modern ciphers have been designed to circumvent these weaknesses. The American Cryptogram Association (ACA) standardized so-called classical ciphers, which had historical relevance up to World War II. Identifying these cipher types using machine learning has shown promising results, but the state of the art relies on engineered features based on cryptanalysis. To overcome this dependency on domain knowledge, we explore in this paper the applicability of the two feature-learning algorithms long short-term memory (LSTM) and Transformer, for 55 classical cipher types from ACA. To lower the necessary data and the training time, various transfer-learning scenarios are investigated. Over a dataset of 10 million ciphertexts with a text length of 100 characters, Transformer correctly identified 72.33% of the ciphers, which is a slightly worse result than the best feature-engineering approach. Furthermore, with an ensemble model of feature-engineering and feature-learning neural network types, 82.78% accuracy over the same dataset has been achieved, which is the best known result for this significant problem in the field of cryptanalysis.
... Recognizing and extracting information from these documents is important to understand our cultural heritage, since it helps to shed new light on and (re-)interpret our history [2]. However, a manual transcription is unfeasible due to the amount of manuscripts, and the automatic recognition is difficult due to the very few availability of annotated data for training. ...
Preprint
Full-text available
Handwritten text recognition in low resource scenarios, such as manuscripts with rare alphabets, is a challenging problem. The main difficulty comes from the very few annotated data and the limited linguistic information (e.g. dictionaries and language models). Thus, we propose a few-shot learning-based handwriting recognition approach that significantly reduces the human labor annotation process, requiring only few images of each alphabet symbol. First, our model detects all symbols of a given alphabet in a textline image, then a decoding step maps the symbol similarity scores to the final sequence of transcribed symbols. Our model is first pretrained on synthetic line images generated from any alphabet, even though different from the target domain. A second training step is then applied to diminish the gap between the source and target data. Since this retraining would require annotation of thousands of handwritten symbols together with their bounding boxes, we propose to avoid such human effort through an unsupervised progressive learning approach that automatically assigns pseudo-labels to the non-annotated data. The evaluation on different manuscript datasets show that our model can lead to competitive results with a significant reduction in human effort.
... Nonetheless, to reach this end, a great deal of research remains to be done in which promising results are emerging progressively in projects like those promoted by the Institut de recherche et d'histoire des textes (France), the IRIS program Scripta-PSL History and practices of writing (Université Paris) or the Centre for the Study of Manuscript Cultures of the University of Hamburg (Germany). Even more: an upper grade of difficulty working with special manuscripts emerges from the attempts of automatic (or better said semi-automatic) tools to transcribe encrypted historical texts as those developed in the DECRYPT project (Megyesi et al., 2020). ...
... Example documents include encrypted letters, diplomatic correspondences, and books from secret societies ( Figure 1). Previous work has made historical cipher collections available for researchers (Pettersson and Megyesi, 2019;Megyesi et al., 2020). Decipherment of classical ciphers is an essential step to reveal the contents of those historical documents. ...
... Libraries and archives have many enciphered documents from the early modern period. Previous work has been done to make historical cipher collections available for researchers (Megyesi et al., 2020;Pettersson and Megyesi, 2019). Decipherment of classical ciphers is an essential step to reveal the contents of those historical documents. ...
Preprint
Decipherment of historical ciphers is a challenging problem. The language of the target plaintext might be unknown, and ciphertext can have a lot of noise. State-of-the-art decipherment methods use beam search and a neural language model to score candidate plaintext hypotheses for a given cipher, assuming plaintext language is known. We propose an end-to-end multilingual model for solving simple substitution ciphers. We test our model on synthetic and real historical ciphers and show that our proposed method can decipher text without explicit language identification and can still be robust to noise.
... Given the difficulties in the decryption of such manuscripts, some multi-disciplinar initiatives [1] have emerged to join the expertise in computer vision, computational linguistics, philology, criptoanalysis and history to make advances in historical cryptology. These joint efforts aim to ease the collection, transcription, decryption and contextualization of historical ciphered manuscripts in order to unlock their contents and make the secret information available for scholars in history, science, religion, etc. ...
Preprint
Full-text available
Encoded (or ciphered) manuscripts are a special type of historical documents that contain encrypted text. The automatic recognition of this kind of documents is challenging because: 1) the cipher alphabet changes from one document to another, 2) there is a lack of annotated corpus for training and 3) touching symbols make the symbol segmentation difficult and complex. To overcome these difficulties, we propose a novel method for handwritten ciphers recognition based on few-shot object detection. Our method first detects all symbols of a given alphabet in a line image, and then a decoding step maps the symbol similarity scores to the final sequence of transcribed symbols. By training on synthetic data, we show that the proposed architecture is able to recognize handwritten ciphers with unseen alphabets. In addition, if few labeled pages with the same alphabet are used for fine tuning, our method surpasses existing unsupervised and supervised HTR methods for ciphers recognition.