Homophonic Substitution Analyzer in CT2 -Analyzing the infamous Zodiac-408 letter.

Source publication

Figure 2. DECODE Decipherer component in CrypTool 2 -showing the...

Figure 3. Homophonic Substitution Analyzer in CT2 -Analyzing the...

Figure 4. The integration of the DECODE database with the open-source...

Decryption of historical manuscripts: the DECRYPT project

Article

Full-text available

Feb 2020

Many historians and linguists are working individually and in an uncoordinated fashion on the identification and decryption of historical ciphers. This is a time-consuming process as they often work without access to automatic methods and processes that can accelerate the decipherment. At the same time, computer scientists and cryptologists are dev...

Context 1

... implemented cryptanalysis algorithms based on hill climbing and simulated annealing, allowing the user to break homophonic substitution ciphers in CT2 (Kopal 2019). Figure 3 shows the Homophonic Substitution Analyzer component in CT2 analyzing the infamous Zodiac-408 letter. The tool allows the user to automatically and semiautomatically analyze homophonically encrypted texts. ...

View in full-text

Context 2

... revealed words can be marked and "locked" for further analysis steps. In Figure 3, locked letters are marked green and automatically found words are marked blue. In future work, we plan to adapt the analyzer in such a way that it is able to analyze homophones of different lengths, e.g. ...

View in full-text

An STS analysis of a digital humanities collaboration: trading zones, boundary objects, and interactional expertise in the DECRYPT project

Article

Full-text available

May 2024

A widely shared recognition over the past decade is that the methodology and the basic concepts of science and technology studies (STS) can be used to analyze collaborations in the cross-disciplinary field of digital humanities (DH). The concepts of trading zones (Galison, 2010), boundary objects (Star and Griesemer, 1989), and interactional expertise (Collins and Evans, 2007) are particularly fruitful for describing projects in which researchers from massively different epistemic cultures (Knorr Cetina, 1999) are trying to develop a common language. The literature, however, primarily concentrates on examples where only two parties, historians and IT experts, work together. More exciting perspectives open up for analysis when more than two, more nuanced and different epistemic cultures seek a common language and common research goals. In the DECRYPT project funded by the Swedish Research Council, computational linguists, historians, computer scientists and AI experts, cryptologists, computer vision specialists, historical linguists, archivists, and philologists collaborate with strikingly different methodologies, publication patterns, and approaches. They develop and use common resources (including a database and a large collection of European historical texts) and tools (among others a code-breaking software, a hand-written text recognition tool for transcription), researching partly overlapping topics (handwritten historical ciphers and keys) to reach common goals. In this article, we aim to show how the STS concepts are illuminating when describing the mechanisms of the DECRYPT collaboration and shed some light on the best practices and challenges of a truly cross-disciplinary DH project.

Encrypted Documents and Cipher Keys From the 18th and 19th Century in the Archives of Aristocratic Families in Slovakia

Conference Paper

Full-text available

May 2023

In this article, we present encrypted documents and cipher keys from the 18th and 19th century, related to central-European aristocratic families Amade-Üchtritz, Esterházy, and Pálffy-Daun. In the first part of the article, we present an overview and analysis of the available documents from the archives with examples. We provide a short historical overview of the people related to the analyzed documents to provide a context for the research. In the second part of the article, we focus on the digital processing of these historical manuscripts. We developed new tools based on machine learning that can automate the transcription of encrypted parts of the documents, which contain only digits as cipher text alphabet. Our digit detection and segmentation are based on YOLOv7. YOLOv7 provided good detection precision and was able to cope with problems like noisy paper background and areas where digits collided with the text from the reverse side of the paper.

Automated Transcription of Historical Encrypted Manuscripts

Article

Full-text available

Dec 2022

This paper deals with historical encrypted manuscripts and introduces an automated method for the detection and transcription of ciphertext symbols for subsequent cryptanalysis. Our database contains documents used in the past by aristocratic families living in the territory of Slovakia. They are encrypted using a nomenclator which is a specific type of substitution cipher. In our case, the nomenclator uses digits as ciphertext symbols. We have proposed a method for the detection, classification, and transcription of handwritten digits from the original documents. Our method is based on Mask R-CNN which is a deep convolutional neural network for instance segmentation. Mask R-CNN was trained on a manually collected database of digit annotations. We employ a specific strategy where the input image is first divided into small blocks. The image blocks are then passed to Mask R-CNN to obtain detections. This way we avoid problems related to the detection of a large number of small dense objects in a high-resolution image. Experiments have shown promising detection performance for all digit types with minimum false detections.

Segmenting Numerical Substitution Ciphers

Preprint

May 2022

Deciphering historical substitution ciphers is a challenging problem. Example problems that have been previously studied include detecting cipher type, detecting plaintext language, and acquiring the substitution key for segmented ciphers. However, attacking unsegmented, space-free ciphers is still a challenging task. Segmentation (i.e. finding substitution units) is the first step towards cracking those ciphers. In this work, we propose the first automatic methods to segment those ciphers using Byte Pair Encoding (BPE) and unigram language models. Our methods achieve an average segmentation error of 2\% on 100 randomly-generated monoalphabetic ciphers and 27\% on 3 real homophonic ciphers. We also propose a method for solving non-deterministic ciphers with existing keys using a lattice and a pretrained language model. Our method leads to the full solution of the IA cipher; a real historical cipher that has not been fully solved until this work.

Detection of Classical Cipher Types with Feature-Learning Approaches

Chapter

Full-text available

Dec 2021

To break a ciphertext, as a first step, it is essential to identify the cipher used to produce the ciphertext. Cryptanalysis has acquired deep knowledge on cryptographic weaknesses of classical ciphers, and modern ciphers have been designed to circumvent these weaknesses. The American Cryptogram Association (ACA) standardized so-called classical ciphers, which had historical relevance up to World War II. Identifying these cipher types using machine learning has shown promising results, but the state of the art relies on engineered features based on cryptanalysis. To overcome this dependency on domain knowledge, we explore in this paper the applicability of the two feature-learning algorithms long short-term memory (LSTM) and Transformer, for 55 classical cipher types from ACA. To lower the necessary data and the training time, various transfer-learning scenarios are investigated. Over a dataset of 10 million ciphertexts with a text length of 100 characters, Transformer correctly identified 72.33% of the ciphers, which is a slightly worse result than the best feature-engineering approach. Furthermore, with an ensemble model of feature-engineering and feature-learning neural network types, 82.78% accuracy over the same dataset has been achieved, which is the best known result for this significant problem in the field of cryptanalysis.

Few Shots Is All You Need: A Progressive Few Shot Learning Approach for Low Resource Handwriting Recognition

Preprint

Full-text available

Jul 2021

Handwritten text recognition in low resource scenarios, such as manuscripts with rare alphabets, is a challenging problem. The main difficulty comes from the very few annotated data and the limited linguistic information (e.g. dictionaries and language models). Thus, we propose a few-shot learning-based handwriting recognition approach that significantly reduces the human labor annotation process, requiring only few images of each alphabet symbol. First, our model detects all symbols of a given alphabet in a textline image, then a decoding step maps the symbol similarity scores to the final sequence of transcribed symbols. Our model is first pretrained on synthetic line images generated from any alphabet, even though different from the target domain. A second training step is then applied to diminish the gap between the source and target data. Since this retraining would require annotation of thousands of handwritten symbols together with their bounding boxes, we propose to avoid such human effort through an unsupervised progressive learning approach that automatically assigns pseudo-labels to the non-annotated data. The evaluation on different manuscript datasets show that our model can lead to competitive results with a significant reduction in human effort.

Training eyes and training hands in the digital research with manuscripts

Article

Full-text available

Jun 2021

Diego Navarro Bonilla

Can Sequence-to-Sequence Models Crack Substitution Ciphers?

Conference Paper

Jan 2021

Can Sequence-to-Sequence Models Crack Substitution Ciphers?

Preprint

Dec 2020

Decipherment of historical ciphers is a challenging problem. The language of the target plaintext might be unknown, and ciphertext can have a lot of noise. State-of-the-art decipherment methods use beam search and a neural language model to score candidate plaintext hypotheses for a given cipher, assuming plaintext language is known. We propose an end-to-end multilingual model for solving simple substitution ciphers. We test our model on synthetic and real historical ciphers and show that our proposed method can decipher text without explicit language identification and can still be robust to noise.

A Few-shot Learning Approach for Historical Ciphered Manuscript Recognition

Preprint

Full-text available

Sep 2020

Encoded (or ciphered) manuscripts are a special type of historical documents that contain encrypted text. The automatic recognition of this kind of documents is challenging because: 1) the cipher alphabet changes from one document to another, 2) there is a lack of annotated corpus for training and 3) touching symbols make the symbol segmentation difficult and complex. To overcome these difficulties, we propose a novel method for handwritten ciphers recognition based on few-shot object detection. Our method first detects all symbols of a given alphabet in a line image, and then a decoding step maps the symbol similarity scores to the final sequence of transcribed symbols. By training on synthetic data, we show that the proposed architecture is able to recognize handwritten ciphers with unseen alphabets. In addition, if few labeled pages with the same alphabet are used for fine tuning, our method surpasses existing unsupervised and supervised HTR methods for ciphers recognition.

Homophonic Substitution Analyzer in CT2 -Analyzing the infamous Zodiac-408 letter.

Contexts in source publication

Citations