PreprintPDF Available

Open Source Tesseract in Re-OCR of Finnish Fraktur from 19 th and Early 20 th Century Newspapers and Journals-Collected Notes on Quality Improvement

January 2019

January 2019

DOI:10.13140/RG.2.2.36328.72960

Authors:

Kimmo Kettunen

Mika Koistinen

Silo AI

This paper presents work that has been carried out in the National Library of Finland to improve optical character recognition (OCR) quality of a Finnish historical newspaper and journal collection 1771-1910. Work and results reported in the paper are based on a 500 000 word ground truth (GT) sample of the Finnish language part of the whole collection. The sample has three different parallel parts: a manually corrected ground truth version, original OCR with ABBYY FineReader v. 7 or v. 8, and an ABBYY FineReader v. 11 re-OCRed version. Based on this sample and its page image originals we have developed a re-OCRing process using the open source software package Tesser-act 1 v. 3.04.01. Our methods in the re-OCR include image preprocessing techniques , usage of morphological analyzers and a set of weighting rules for resulting candidate words. Besides results based on the GT sample we present also results of re-OCR for a 29 year period of one newspaper of our collection, Uusi Suometar. The paper describes the results of our re-OCR process including the latest results. We also state some of the main lessons learned during the development work.

. Corrections of Levenshtein distance of 5.

…

Figures - uploaded by Kimmo Kettunen

Content may be subject to copyright.

Content uploaded by Kimmo Kettunen

Content may be subject to copyright.

To appear in DHN2019

Open Source Tesseract in Re-OCR of Finnish Fraktur

from 19th and Early 20th Century Newspapers and

Journals – Collected Notes on Quality Improvement

Kimmo Kettunen [0000-0003-2747-1382] and Mika Koistinen

The National Library of Finland, DH projects Saimaankatu 6, 50 100 Mikkeli, Finland

Firstname.lastname@helsinki.fi

Abstract. This paper presents work that has been carried out in the National

Library of Finland to improve optical character recognition (OCR) quality of a

Finnish historical newspaper and journal collection 1771–1910. Work and re-

sults reported in the paper are based on a 500 000 word ground truth (GT) sam-

ple of the Finnish language part of the whole collection. The sample has three

different parallel parts: a manually corrected ground truth version, original OCR

with ABBYY FineReader v. 7 or v. 8, and an ABBYY FineReader v. 11 re-

OCRed version. Based on this sample and its page image originals we have de-

veloped a re-OCRing process using the open source software package Tesser-

act

v. 3.04.01. Our methods in the re-OCR include image preprocessing tech-

niques, usage of morphological analyzers and a set of weighting rules for result-

ing candidate words. Besides results based on the GT sample we present also

results of re-OCR for a 29 year period of one newspaper of our collection, Uusi

Suometar.

The paper describes the results of our re-OCR process including the latest

results. We also state some of the main lessons learned during the development

work.

Keywords: OCR; historical newspapers; Tesseract; Finnish

1 Introduction

The National Library of Finland has digitized historical newspapers and journals pub-

lished in Finland between 1771 and 1929 and provides them online [1-2]. The last

decade of the open collection, 1920–1929, was released in early 2018. The collection

contains approximately 7.45 million freely available pages primarily in Finnish and

Swedish. The total amount of pages on the web is over 14.5 million, and about half of

them are in restricted use due to copyright restrictions. The National Library’s Digital

Collections are offered via the digi.kansalliskirjasto.fi web service, also known as

Digi. An open data package of the collection’s newspapers and journals from period

1771 to 1910 has been released in early 2017 [2].

https://github.com/tesseract-ocr

When originally non-digital materials, e.g. old newspapers and books, are digit-

ized, the process involves first scanning of the documents which results in image files.

Out of the image files one needs to sort out texts and possible non-textual data, such

as photographs and other pictorial representations. Texts are recognized from the

scanned pages with Optical Character Recognition (OCR) software. OCRing for

modern prints and font types is considered a resolved problem, that usually yields

high quality results, but results of historical document OCRing are still far from that

[3].

Newspapers of the 19th and early 20th century were mostly printed in the Gothic

(Fraktur, blackletter) typeface in Europe. Fraktur is used heavily in our data, although

also Antiqua is common and both fonts can be used in same publication in different

parts. It is well known that the Fraktur typeface is especially difficult to recognize for

OCR software. Other aspects that affect the quality of OCR recognition are the fol-

lowing [3–5]:

● quality of the original source and microfilm

● scanning resolution and file format

● layout of the page

● OCR engine training

● unknown fonts

● etc.

Due to these difficulties scanned and OCRed document collections have a varying

amount of errors in their content. A quite typical example is The 19th Century News-

paper Project of the British Library [6]: based on a 1% double keyed sample of the

whole collection Tanner et al. report that 78% of the words in the collection are cor-

rect. This quality is not good, but quite common to many comparable collections. The

amount of errors depends heavily on the period and printing form of the original data.

Older newspapers and magazines are more difficult for OCR; newspapers from the

early 20th century are easier (cf. for example data of Niklas [7], that consists of a 200

year period of The Times of London from 1785 to 1985). There is no exact measure

of the amount of errors that makes OCRed material useful or less useful for some

purpose and the use purposes and research tasks of the users of digitized material vary

hugely [8]. A linguist who is interested in the forms of words needs as errorless data

as possible; a historian who interprets texts on a broader level may be satisfied with

text data that has more errors. Anyhow, very high error rate of texts may cause serious

discomfort and squeamishness for researchers as e.g. article of Jarlbrink and Snickars

about quality of one OCRed Swedish newspaper, Aftonbladet 1830–1862, shows [9].

Ways to improve quality of OCRed texts are few, if total rescanning is out of ques-

tion, as it usually is due to labor costs. Improvement can be achieved with three princi-

pal methods: manual correction with different aids (e.g. editing software), re-OCRing

or algorithmic post-correction [3]. These methods can also be mixed. We don’t believe

that manual correction e.g. with crowd sourcing is suitable for a large collection of a

small language with small population: there just is not enough people to perform

crowdsourcing. Also post correction’s capabilities are limited: errors of one to two

characters can be corrected, but errors in historical OCR data do not limit to these. It

seems that harder errors are still beyond performance of post correction algorithms

[10-11].

Due to amount of data we have chosen re-OCRing with Tesseract v. 3.04.01 as

our main method for improving the quality of our collection. In the rest of the paper

we describe the results we have achieved so far and discuss lessons learned. In section

two we describe our initial results, in section three improvements made in the re-OCR

process and in section four the latest re-OCR results. Section five concludes the paper

with some lessons that we have learned during the process.

2 Results – Part I

Our re-OCR process has been described thoroughly in [12–13]. As its main parts are

unchanged, we describe it only briefly here. The re-OCRing process consists of four

parts: 1) image preprocessing of page images using five different techniques: this

yields better quality images for the OCR, 2) Tesseract OCR 3.04.01, 3) choosing of the

best candidate from Tesseract’s output and old ABBYY FineReader data and 4) trans-

formation of Tesseract’s output to ALTO format. We have developed a new Finnish

Fraktur model for Tesseract using an existing German Fraktur model as a starting

point.

We have evaluated the results of the re-OCR along the development process with

different measures using our ground truth data of about 500 000 words [14]. This

parallel data consists of proof read version of the data, current ABBYY FineReader

OCR v.7/8, Tesseract 3.04.01 OCR and ABBYY FineReader v.11 OCR.

2.1 Precision and Recall

Measurement of OCR improvement does not have any real standard measure, and for

this reason we have used several measures to be able to evaluate improvement of the

process. Precision and recall are standard measures used in information retrieval, and

they can also be applied to analysis of re-OCR results [10]. When we applied recall,

precision and F-score to the data, we got recall of 0.72, precision of 0.73 and F-score

of 0.73. Combined optimal OCR results of Tesseract and ABBYY FineReader v. 11

would give recall of 0.81, precision of 0.95, and F-score of 0.88. The latter figures

show that possibility of using several OCR engines would benefit re-OCRing, as has

been stated in research literature [15]. Unfortunately we do not have access to several

new OCR engines in our final re-OCR.

Precision, recall and their combination, F-score, are useful figures, but it also

benefits to take a closer look at the numbers behind the scores. As we analyzed the

output of the P/R analysis further we noticed the following. Number of erroneous

words in the data of was 126 758 and errorless 345 145. Re-OCR corrected 90 877 of

errors (true positives, 71.7% of errors) and left 35 881 uncorrected (false negatives,

28.3% of errors). The OCR process also produced 32 953 new errors to the data (false

positives). In general it seems, that the recall of the re-OCR with regards to erroneous

words is satisfactory, but precision is low, as the process produces quite a lot of new

errors. This harms the overall result. On the other hand, many of the errors were only

errors in punctuation: if these were discarded, the results were slightly better. Alt-

hough every character counts for algorithms that perform evaluation, not every differ-

ence in character is of equal importance for human understanding of the output re-

sults. Assuming that form Porvoo would be the right result, the three versions Por-

woo/Porwo,/Worvoo that are only two characters away from it are not on equal status

of intelligibility: the last one would probably be the hardest to understand even in

context.

2.2 Character Error and Word Error Rate

Two other commonly used evaluation measures for OCR output are character error

rate, CER, and word error rate, WER [16]. CER is defined as

and it employs the total number n of characters and the minimal number of character

insertions i, substitutions s and deletions d required to transform the reference text

into the OCR output.

Word error rate WER is defined as

where nw is the total number of words in reference text, iw is the minimal number of

insertions, sw is number of substitutions and dw number of deletions on word level to

obtain the reference text. Smaller WER and CER values mean better quality. Our

initial CER and WER results for the OCR process are shown in Table 1. These results

have been analyzed with the OCR evaluation tool

described in Carrasco [16]. As can

be seen from the figures, CER and WER values of the re-OCR are clearly better than

those of the current OCR. Especially clear the difference is in word error rate which

drops to about a half.

Table 1. Character and word error rates for the DIGI test set

Re-OCR

Current OCR

CER

5.84

7.81

WER

13.65

27.3

WER (order independent)

11.88

25.25

http://impact.dlsi.ua.es/ocrevaluation/. A similar software is PRImA Research’s Text Evalua-

tion tool that is available from http://www.primaresearch.org/tools/PerformanceEvaluation.

Evaluation of OCR results can be done experimentally either with or without

ground truth. After initial development and evaluation of the re-OCR process with the

GT data, we started testing of the re-OCR process with realistic newspaper data, i.e.

without GT to avoid overfitting of the data by using GT only in evaluation. We chose

for testing Uusi Suometar, newspaper which appeared in 1869–1918 and has 86 068

pages. Table 2. shows results of a 10 years’ re-OCR of Uusi Suometar with our first

re-OCR process. We show here results of morphological recognition with

(His)Omorfi that has been enhanced to process better historical Finnish. These results

give merely an estimation of improvement in the word quality [1].

Table 2. Recognition rates of current and new OCR words of Uusi Suometar with morphologi-

cal analyzer HisOmorfi (total of 7 937 pages)

Year

Words

Current

OCR

Tesseract

3.04.01

Gain in % units

1869

658 685

69.6%

86.7%

17.1

1870

655 772

66.9%

84.9%

18.0

1871

909 555

73%

87%

14.0

1872

930 493

76%

88.7%

12.7

1873

889 725

75.4%

87.3%

11.9

1874

920 307

72.9%

85.9%

13.0

1875

1 070 806

71.5%

86%

14.5

1876

1 223 455

72.8%

86.7%

13.9

1877

1 815 635

73.9%

86%

12.1

1878

2 135 411

72%

85.4%

13.4

1879

2 238 412

74.7%

87%

12.3

ALL

13 448 256

73%

86.5%

13.5

Re-OCR is improving the recognition rates considerably and consistently. Mini-

mum improvement is 11.9% units, maximum 18% units. In average the improvement

is 13.5% units.

As can be seen, all our initial results show clear improvement in the quality of the

OCR. The improvement could be characterized as noticeable, but not perhaps good

enough.

2.3 Examination of the data: false and true positives

In a closer look part of the false positives of the re-OCR are due to recurring trouble

with quote marking or division of the word on two lines when the word ends with a

hyphen. The re-OCR misses a quote or two in the result word or it produces the

HTML code &quote; instead of quote itself. Many words are also wrongly divided on

the line. The same applies to false negatives, too. Number of all wrong word divisions

in the data of false and true positives together is about 10 000, which makes the error

type one of the most common. Also missing punctuation or extra punctuation causes

errors. When true positives are examined, one can see that about 54% of the errors

corrected are one character corrections and about 89% are 1–3 character corrections.

But re-OCR corrects also truly hard errors. Even errors with Levenshtein distance

(LD) over 10 are corrected, a few examples being the following word pairs of edit

distance of 11 in Table 3.

Table 3. Corrections of Levenshtein distance of 11.

Original OCR

Tesseract 3.04.01

eiifuroauffellt»

esikuwauksellisesti

KarjlltijoloSluSyhbiStytsen

Karjanjalostusyhdistyksen

ttfcnfäMtämifeSfä,

itsensäkieltämisessä,

liiannfiljtccvillc

maansihteerille

Another example of corrected hard errors are 2 376 words that have Le-

venshtein edit distance of five. When the error count is this high, words are becoming

unintelligible. Some examples of corrections with five errors are shown in Table 4.

Table 4. Corrections of Levenshtein distance of 5.

Original OCR

Tesseract 3.04.01

fofoufsessct,

kokouksessa

silmciyfsert

silmäyksen

ncihbessciän

nähdessään

roäliHä

wälillä.

yfsincicin.

yksinään

tylyybestcicin

tylyydestään

fitsattbestaan,

kitsaudestaan.

Iywäzlyllln

Jywäskylän

pairoana

päiwänä

The bigger the error count is, the harder the error would be to correct for post

correction software, and here lies the strength of re-OCR at its best. Reynaert (2016),

e.g., states that his post correction system of Dutch, TICCL, corrects best errors of LD

1-2. It can be run with LD 3, “but this has a high processing cost and most probably

results in lower precision.” Error correction for LD 4 and higher values he considers

too ambitious for the time being. This is also one of the conclusions in Choudhury et

al. (2007).

Number of corrected words with edit distances of 1–10 in true positives of

our re-OCR process can be seen in Table 5.

Levenshtein distance is a string metric for measuring the difference between two sequences.

Informally, the Levenshtein distance between two words is the minimum number of single-

character edits (insertions, deletions or substitutions) required to change one word into the

other. It is named after Vladimir Levenshtein, who considered this distance in 1965.

https://en.wikipedia.org/wiki/Levenshtein_distance

“It is impossible to correct very noisy texts, where the nature of the noise is random and

words are distorted by a large edit distance (say 3 or more).”

Table 5. Number of corrected words with edit distances of 1–10: 99.2% of all the true positives

Edit distance

Number of corrections

LD 1

47 783

LD 2

22 713

LD 3

9 182

LD 4

4 375

LD 5

2 376

LD 6

1 519

LD 7

920

LD 8

629

LD 9

423

LD 10

315

SUM = 90 235 (total of 90 877 true posi-

tives)

Overall, the sum of character errors in the data decreased from old OCR’s 293 364

to 220 254 in Tesseract OCR, which is about a 25% decrease. Tesseract produces

significantly more errorless words than the old OCR (403 069 vs. 345 145), but it

produces also more character errors per erroneous word. Old OCR has about 2.32

errors per erroneous word, Tesseract OCR 3.2. This can be seen as a mixed blessing:

erroneous words are encountered more seldom in Tesseract’s output, but they may be

harder to read and understand when they occur.

3 Improvements for the re-OCR Process

The results we achieved with our initial re-OCR process were at least promising.

They showed clear improvement of the quality in the GT collection and also out of it

with realistic newspaper data shown in Table 2. Slightly better OCR results were

achieved by Drobac et al. [17] with Ocropy machine learning OCR system using

character accuracy rate (CAR) as measure. Post-correction results of Silfverberg et al.

[18], however, were worse than our re-OCR results.

The main drawback of our re-OCR system is that it is relatively slow. Image pro-

cessing and combining of images takes time, if it is performed to every page image as

it is currently done. Execution time of the word level system was initially about 6 750

word tokens per hour when using a CPU with 8 cores in a standard Linux environ-

ment. With increase of cores to 28 the speed improved to 29 628 word tokens per

hour. The speed of the process was still not very satisfying.

Silfverberg et al. have evaluated algorithmic post correction results of hfst-ospell software

with part of the historical data, 40 000 word pairs. They have used correction rate as their

measure, and their best result is 35.09 ± 2.08 (confidence value). Correction rate of our initial

re-OCR process data is 0.47, clearly better than post-correction results of Silfverberg et al. Our

result is also achieved with almost a twelvefold amount of word pairs.

We have been able to improve the processing speed of re-OCR considerably dur-

ing the latest modifications. We have especially improved the string replacements

performed during the process, as they took almost as much time as the image pro-

cessing. String replacements take now only a fraction of the time they took earlier, but

image processing cannot be sped up easily. The new processing takes about half of

the time it used to take with the GT data. We are now able to process about 201 800

word tokens an hour in a 28 core system.

We improved also the process for the word candidate selection after re-OCR. We

have been using two morphological analyzers (Omorfi

and Voikko

), character tri-

grams and other character level data to be able to weight the suggestions given by the

OCR process. We checked especially the trigram list and removed the least frequent

ones from it.

4 Results – Part II

After improvements made to the re-OCR process we have been able to achieve also

better results. The latest results are shown in Tables 6 and 7. Table 6 shows precision,

recall and correction rate results and Table 7 shows results of CER, WER and CAR

analyses using the ground truth data.

Table 6. Precision and recall of the re-OCR after improvements: GT data

Words without errors

374 299

Words with errors

131 008

Errorless not corrected

366 043

Sum (lines 1 and 2)

505 307

True positives

99 071

False negatives

31 937

False positives

8 256

Recall

0.76

Precision

0.92

F-score

0.83

Correction rate

0.69

https://github.com/jiemakel/omorfi

https://voikko.puimula.org/

Table 7. CER, WER and CAR of the re-OCR after improvements: GT data

Re-OCR

Current OCR8

CER

2.05

6.47

WER

6.56

25.30

WER (order independent)

5.51

23.41

CAR

97.64

92.62

Results in Table 6 and 7 show that the re-OCR process has improved clearly

from the initial performance shown in Section 2. Precision of the process has im-

proved considerably, and although recall is still slightly low, F-score is now 0.83

(earlier 0.73). CER and WER have improved also clearly. Our CAR is now also

slightly better than Drobac’s best value without post correction (ours 97.6 vs. Dro-

bac’s 97.3 [17].

Recognition results of the latest re-OCR of Uusi Suometar are shown in Figure 1.

The data consists of years 1869–1898 of the newspaper with about 115 930 415

words and 33 000 pages.

Fig. 1. Latest recognition rates of Uusi Suometar 1869-1898 with HisOmorfi

These figures differ slightly from figures of current OCR in Table 1 due to the fact that the

improved re-OCR process finds now more matching word pairs in the image data.

Re-OCR is improving the quality of the newspaper clearly and consistently and the

overall results are slightly better than in Table 2. The average improvement for the

whole period of 30 years is 15.3% units. The largest improvement is 20.5% units, and

smallest 12% units.

5 Conclusion

We have described in this paper results of a re-OCR process for a historical Finnish

newspaper and journal collection. The developed re-OCR process consists of combi-

nation of five different image pre-processing techniques, a new Finnish Fraktur model

for Tesseract 3.04.01 OCR enhanced with morphological recognition and character

level rules to weight the resulting candidate words. Out of the results we create new

OCRed data in METS and ALTO XML format that can be used in our docWorks

document presentation system.

We have shown that the re-OCRing process yields clearly better results than

commercial OCR engine ABBYY FineReader v. 7/8, which is our current OCR en-

gine. We have also shown that a 29 year time span of newspaper Uusi Suometar (33

000 pages and ca. 115.9 million words) gets significantly and consistently improved

word recognition rates for Tesseract output in comparison to current OCR. We have

also shown that our results are either equal or slightly better than results of a machine

learning OCR system Ocropy in Drobac et al. [17]. Our results outperform clearly

post correction results of Silfverberg et al. [18].

Let us now turn to lessons learned during the re-OCR process so far. Our devel-

opment cycle for a new re-OCR process has been relatively long and taken more time

than we were able to estimate in advance. We started the process by first creating the

GT collection for Finnish [14]. The end result of the process was a ca. 525 000 word

collection of different quality OCR data with ground truth. The size of the collection

could be larger, but with regards to limited means it seems sufficient. In comparison

to GT data used in OCR or post correction literature, it fares also well, being a mid-

sized collection. The GT collection has been the cornerstone of our quality improve-

ment process: effects of the changes in the re-OCR process have been measured with

it. The second time consuming part in the process was creation of a new Fraktur font

model for Finnish. Even if the font was based on an existing German font model, it

needed lots of manual effort in picking letter images from different newspapers and

finding suitable Fraktur fonts for creating synthesized texts. This was, however, cru-

cial for the process, and could not be bypassed.

A third lesson in our process was choice of the actual OCR engine. Most of the

OCR engines that are used in research papers are different versions of latest machine

learning algorithms. They may show nice results in the narrowly chosen evaluation

data, but the software are usually not really production quality products that could be

used in an industrial OCR process that processes 1–2 million page images in a year.

Thus our slightly conservative choice of open source Tesseract that has been around

for more than 20 years is justifiable.

Another, slightly unforeseen problem have been modifications needed to the exist-

ing ALTO XML output of the whole process. As ALTO XML

is a standard approved

https://www.loc.gov/standards/alto/

by the ALTO board, changes to it are not made easily. An easy way to circumvent

this is to use two different ALTOs in the database of docWorks: one conforming to

the existing standard and another one that includes the necessary changes after re-

OCR. We have chosen this route by including some of the word candidates of the re-

OCR in the database as variants.

We shall continue the re-OCR process by re-OCRing first the whole history of

Uusi Suometar. Its 86 000 pages should give us enough experience so that after that

we can move over to re-OCRing the whole Finnish collection. As there are hundreds

of publications to be re-OCRed, usage data of the collections are informative in plan-

ning of the re-OCR: the most used newspapers and journals need to be re-OCRed

first.

We have also created a Swedish language GT collection to be able to start re-

OCRing our Swedish language part of the collection. The size of the Swedish GT

collection will be about 250 K of words from Swedish language newspapers and jour-

nals published in Finland in 1771–1775 and 1798–1919. We should be able to start

quickly re-OCR trials with the Swedish data with our so far developed re-OCR pro-

cess. There should be no need for new font model generation for Swedish Fraktur, as

such a font is already available.

OCR errors in the digitized newspapers and journals may have several harmful ef-

fects for users of the data. One of the most important effects of poor OCR quality –

besides worse readability and comprehensibility – is worse on-line searchability of the

documents in the collections [19–20]. Although information retrieval is quite robust

even with corrupted data IR works best with longer documents and long queries, es-

pecially when the data is of bad quality. Empirical results of Järvelin et al. [21] with a

Finnish historical newspaper search collection, for example, show that even impracti-

cally heavy usage of fuzzy matching in order to circumvent effects of OCR errors will

help only to a limited degree in search of a low quality OCRed newspaper collection,

when short queries and their query expansions are used.

Weaker searchability of the OCRed collections is one dimension of poor OCR

quality. Other effects of poor OCR quality may show in the more detailed processing

of the documents, such as sentence boundary detection, tokenization and part-of-

speech-tagging, which are important in higher-level natural language processing tasks

[22]. Part of the problems may be local, but part will cumulate in the whole pipeline

of natural language processing causing errors. Thus quality of the OCRed texts is the

cornerstone for any kind of further usage of the material and improvements in OCR

quality are welcome. And last but not least, user dissatisfaction with the quality of the

OCR, as testified e.g. in Jarlbrink and Snickars [9], is of great importance. Digitized

historical newspaper and journal collections are meant for users, both researchers and

lay person. If they are not satisfied with the quality of the content, improvements need

to be made.

Acknowledgment

This work is funded by the European Regional Development Fund and the program

Leverage from the EU 2014-2020.

References

1. Kettunen, K., Pääkkönen, T.: Measuring Lexical Quality of a Historical Finnish Newspa-

per Collection – Analysis of Garbled OCR Data with Basic Language Technology Tools

and Means,” Proc. of the Tenth International Conference on Language Resources and

Evaluation (LREC 2016).

2. Pääkkönen, T., Kervinen, J., Nivala, A., Kettunen, K., Mäkelä, E.: Exporting Finnish Dig-

itized Historical Newspaper Contents for Offline Use. D-Lib Magazine, July/August

(2016).

3. Piotrowski, M.: Natural Language Processing for Historical Texts. Synthesis Lectures on

Human Language Technologies, Morgan & Claypool Publishers (2012).

4. Holley, R.: How good can it get? Analysing and Improving OCR Accuracy in Large Scale

Historic Newspaper Digitisation Programs. D-Lib Magazine, 15(3/4) (2009).

5. Doermann, D., Tombre, K. (Eds.): Handbook of Document Image Processing and Recog-

nition. Springer (2014).

6. Tanner, S., Muñoz, T., Ros, P.H.: Measuring Mass Text Digitization Quality and Useful-

ness. Lessons Learned from Assessing the OCR Accuracy of the British Library's 19th

Century Online Newspaper Archive. D-Lib Magazine, (15/8) (2009).

7. Niklas, K.: Unsupervised Post-Correction of OCR Errors. Diploma Thesis, Leibniz Uni-

versität, Hannover. www.l3s.de/~tahmasebi/Diplomarbeit_Niklas.pdf (2010).

8. Traub, M. C., Ossenbruggen, J. van, Hardman, L.: Impact Analysis of OCR Quality on Re-

search Tasks in Digital Archives. In: Kapidakis, S., Mazurek, C., Werla, M. (eds.), Re-

search and Advanced Technology for Libraries. Lecture Notes in Computer Science, vol.

9316, pp. 252-263 (2015).

9. Jarlbrink, J., Snickars, P.: Cultural heritage as digital noise: nineteenth century newspapers

in the digital archive. Journal of Documentation, https://doi.org/10.1108/JD-09-2016-0106

(2017).

10. Reynaert, M.: OCR Post-Correction Evaluation of Early Dutch Books Online – Revisited.

In Proceedings of LREC, pp. 967–974 (2016)

11. Choudhury, M. Thomas, M., Mukherjee, A., Basu, A., Ganguly, N.: How difficult is it to

develop a perfect spell-checker? A cross-linguistic analysis through complex network ap-

proach. In Proceedings of the second workshop on TextGraphs: Graph-based algorithms

for natural language processing, pp. 81–88, (2007).

12. Koistinen, M., Kettunen, K., Kervinen, J.: How to Improve Optical Character Recognition

of Historical Finnish Newspapers Using Open Source Tesseract OCR Engine. Proc. of

LTC 2017, Nov. 2017, pp. 279–283 (2017).

13. Koistinen, M., Kettunen, K., Pääkkönen, T.: Improving Optical Character Recognition of

Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Im-

age Preprocessing. Proc. of the 21st Nordic Conference on Computational Linguistics,

NoDaLiDa, May 2017, pp. 277–283 (2017).

14. Kettunen, K., Kervinen, J., Koistinen, M.: Creating and using ground truth OCR sample

data for Finnish historical newspapers and journals. In DHN2018, Proceedings of the Digi-

tal Humanities in the Nordic Countries 3rd Conference, 162-169. http://ceur-ws.org/Vol-

2084/ (2018).

15. Volk, M., Furrer, L., Sennrich, R.: Strategies for reducing and correcting OCR errors. In C.

Sporleder, A. van den Bosch, and K. Zervanou, Eds. Language Technology for Cultural

Heritage, 2011, 3–22 (2011).

16. Carrasco, R.C.: An open-source OCR evaluation tool. In: Proceeding DATeCH '14 Pro-

ceedings of the First International Conference on Digital Access to Textual Cultural Herit-

age, 179-184 (2014)

17. Drobac, S., Kauppinen, P., Lindén, K.: OCR and post-correction of historical Finnish texts.

In: Tiedemann, J. (ed.) Proceedings of the 21st Nordic Conference on Computational Lin-

guistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden, 70-76 (2017)

18. Silfverberg, M., Kauppinen, P., Linden, K.: Data-Driven Spelling Correction Using

Weighted Finite-State Method. In: Proceedings of the ACL Workshop on Statistical NLP

and Weighted Automata, 51–59, https://aclweb.org/anthology/W/W16/W16-2406.pdf

(2016)

19. Taghva, K., Borsack, J., Condit, A.: Evaluation of Model-Based Retrieval Effectiveness

with OCR Text. ACM Transactions on Information Systems, 14(1), 64–93 (1996)

20. Kantor, P. B., Voorhees, E. M.: The TREC-5 Confusion Track: Comparing Retrieval

Methods for Scanned Texts. Information Retrieval, 2, 165–176 (2000)

21. Järvelin, A., Keskustalo, H., Sormunen, E., Saastamoinen, M. and Kettunen, K.: Infor-

mation retrieval from historical newspaper collections in highly inflectional languages: A

query expansion approach. Journal of the Association for Information Science and Tech-

nology 67(12), 2928–2946 (2016)

22. Lopresti, D.: Optical character recognition errors and their effects on natural language pro-

cessing. International Journal on Document Analysis and Recognition, 12: 141–151 (2009)

ResearchGate has not been able to resolve any citations for this publication.

Creating and using ground truth OCR sample data for Finnish historical newspapers and journals

Conference Paper

Full-text available

Mar 2018

The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12.9 million pages mainly in Finnish and Swedish. Out of these about 7.36 million pages are freely available on the web site di-gi.kansalliskirjasto.fi. The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. The years 1920–1929 were opened in January 2018. This paper presents the ground truth Optical Character Recognition data of about 500 000 Finnish words that has been compiled at the NLF for development of a new OCR process for the collection. We discuss compilation of the data and show basic results of the new OCR process in comparison to current OCR using the ground truth data.

How to Improve Optical Character Recognition of Historical Finnish Newspapers Using Open Source Tesseract OCR Engine

Conference Paper

Full-text available

Nov 2017

The current paper presents work that has been carried out in the National Library of Finland (NLF) to improve optical character recognition (OCR) quality of the historical Finnish newspaper collection 1771-1910. Results reported in the paper are based on a 500 000 word sample of the Finnish language part of the whole collection. The sample has three different parallel parts: a manually corrected ground truth version, original OCR with ABBYY FineReader v. 7 or v. 8, and an ABBYY FineReader v. 11 re-OCRed version. Using this sample and its page image originals we have developed a re-OCRing procedure using the open source software package Tesseract v. 3.04.01. Our method achieves 27.48% improvement vs. ABBYY FineReader 7 or 8 and 9.16% improvement vs. ABBYY FineReader 11 on document level. On word level our method achieves 36.25% improvement vs. ABBYY FineReader 7 or 8 and 20.14% improvement vs. ABBYY FineReader 11. Precision and recall results on word level show that both recall and precision of the re-OCRing process are on the level of 0.69-0.71 compared to old OCR. Other measures, such as recognizability of words with a morphological analyzer and character accuracy rate, show also clear improvement after re-OCRing.

Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing

Conference Paper

Full-text available

May 2017

In this paper we describe a method for improving the optical character recognition (OCR) toolkit Tesseract for Finnish historical documents. First we create a model for Finnish Fraktur fonts. Second we test Tesseract with the created Fraktur model and Antiqua model on single images and combinations of images with different image preprocessing methods. Against commercial ABBYY FineReader toolkit our method achieves 27.48% (FineReader 7 or 8) and 9.16% (FineReader 11) improvement on word level.

Data-Driven Spelling Correction using Weighted Finite-State Methods

Conference Paper

Full-text available

Jan 2016

This paper presents two systems for spelling correction formulated as a sequence labeling task. One of the systems is an unstructured classifier and the other one is structured. Both systems are implemented using weighted finite-state methods. The structured system delivers stateof-the-art results on the task of tweet normalization when compared with the recent AliSeTra system introduced by Eger et al. (2016) even though the system presented in the paper is simpler than AliSeTra because it does not include a model for input segmentation. In addition to experiments on tweet normalization, we present experiments on OCR post-processing using an Early Modern Finnish corpus of OCR processed newspaper text.

Exporting Finnish Digitized Historical Newspaper Contents for Offline Use

Article

Full-text available

Jul 2016

Digital collections of the National Library of Finland (NLF) at http://Digi.kansalliskirjasto.fi contain over 10 million pages of historical newspapers, journals and some technical ephemera. The material ranges from the early Finnish newspapers from 1771 until present day. The material up to 1910 can be viewed in the public web service, where as anything later is available at the six legal deposit libraries in Finland. A recent user study (Hölttä, 2016) noticed that different type of researcher use is one of the key uses of the collection. National Library of Finland has got several requests to provide the content of the digital collections as one offline bundle, where all the needed contents are included. For this purpose we introduced a new format, which contains three different information sets: the full metadata of a publication page, the actual page content as ALTO XML, and the raw text content. We consider these formats most useful to be provided as raw data for the researchers. In this paper we will describe how the export format was created, how other parties have packaged the same data and what are the benefits of the current approach. We shall also shortly discuss word level quality of the content and show a real research scenario for the data.

Measuring Lexical Quality of a Historical Finnish Newspaper Collection – Analysis of Garbled OCR Data with Basic Language Technology Tools and Means

Conference Paper

Full-text available

May 2016

The National Library of Finland has digitized a large proportion of the historical newspapers published in Finland between 1771 and 1910 (Bremer-Laamanen 2001). This collection contains approximately 1.95 million pages in Finnish and Swedish. Finnish part of the collection consists of about 2.39 billion words. The National Library's Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. Part of this material is also available freely downloadable in The Language Bank of Finland provided by the Fin-CLARIN consortium. The collection can also be accessed through the Korp environment that has been developed by Språkbanken at the University of Gothenburg and extended by FIN-CLARIN team at the University of Helsinki to provide concordances of text resources. A Cranfield-style information retrieval test collection has been produced out of a small part of the Digi newspaper material at the University of Tampere (Järvelin et al., 2015). The quality of the OCRed collections is an important topic in digital humanities, as it affects general usability and searchability of collections. There is no single available method to assess the quality of large collections, but different methods can be used to approximate the quality. This paper discusses different corpus analysis style ways to approximate the overall lexical quality of the Finnish part of the Digi collection.

Cultural heritage as digital noise: nineteenth century newspapers in the digital archive

Article

Oct 2017
J DOC

Purpose The purpose of this paper is to explore and analyze the digitized newspaper collection at the National Library of Sweden, focusing on cultural heritage as digital noise. In what specific ways are newspapers transformed in the digitization process? If the digitized document is not the same as the source document – is it still a historical record, or is it transformed into something else? Design/methodology/approach The authors have analyzed the XML files from Aftonbladet 1830 to 1862. The most frequent newspaper words not matching a high-quality references corpus were selected to zoom in on the noisiest part of the paper. The variety of the interpretations generated by optical character recognition (OCR) was examined, as well as texts generated by auto-segmentation. The authors have made a limited ethnographic study of the digitization process. Findings The research shows that the digital collection of Aftonbladet contains extreme amounts of noise: millions of misinterpreted words generated by OCR, and millions of texts re-edited by the auto-segmentation tool. How the tools work is mostly unknown to the staff involved in the digitization process? Sticking to any idea of a provenance chain is hence impossible, since many steps have been outsourced to unknown factors affecting the source document. Originality/value The detail examination of digitally transformed newspapers is valuable to scholars depending on newspaper databases in their research. The paper also highlights the fact that libraries outsourcing digitization processes run the risk of losing control over the quality of their collections.

Information retrieval from historical newspaper collections in highly inflectional languages: A query expansion approach

Article

Jun 2015

The aim of the study was to test whether query expansion by approximate string matching methods is beneficial in retrieval from historical newspaper collections in a language rich with compounds and inflectional forms (Finnish). First, approximate string matching methods were used to generate lists of index words most similar to contemporary query terms in a digitized newspaper collection from the 1800s. Top index word variants were categorized to estimate the appropriate query expansion ranges in the retrieval test. Second, the effectiveness of approximate string matching methods, automatically generated inflectional forms, and their combinations were measured in a Cranfield-style test. Finally, a detailed topic-level analysis of test results was conducted. In the index of historical newspaper collection the occurrences of a word typically spread to many linguistic and historical variants along with optical character recognition (OCR) errors. All query expansion methods improved the baseline results. Extensive expansion of around 30 variants for each query word was required to achieve the highest performance improvement. Query expansion based on approximate string matching was superior to using the inflectional forms of the query words, showing that coverage of the different types of variation is more important than precision in handling one type of variation.

Natural language processing

Article

Jan 1996
INFORM PROCESS MANAG

Patrick Bryan Heidorn

Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned from Assessing the OCR Accuracy of the British Library's 19th Century Online Newspaper Archive

Article

Jul 2009

Open Source Tesseract in Re-OCR of Finnish Fraktur from 19 th and Early 20 th Century Newspapers and Journals-Collected Notes on Quality Improvement

Abstract and Figures

Recommended publications

How to Improve Optical Character Recognition of Historical Finnish Newspapers Using Open Source Tess...

Open Source Tesseract in Re-OCR of Finnish Fraktur from 19 th and Early 20 th Century Newspapers and...

How to Improve Optical Character Recognition of Historical Finnish Newspapers Using Open Source Tess...

Re-OCR in Action-Using Tesseract to Re-OCR Finnish Fraktur from 19 th and Early 20 th Century Newspa...

Creating and using ground truth OCR sample data for Finnish historical newspapers and journals