Conference PaperPDF Available

Extracting table data from images using optical character recognition text

May 2018

May 2018

DOI:10.1109/SIU.2018.8404746

Conference: 2018 26th Signal Processing and Communications Applications Conference (SIU)

Authors:

Mehmet Yasin Akpınar

Bogazici University

Erdem Emekligil

Bogazici University

Secil Arslan

Unicredit Turkey - Yapi Kredi Bank

Content uploaded by Mehmet Yasin Akpınar

Content may be subject to copyright.

Optik Karakter Tanıma Metinlerini Kullanarak

Görüntülerden Tablo Verilerini Ayıklama

Extracting Table Data from Images Using Optical

Character Recognition Text

Mehmet Yasin AKPINAR, Erdem EMEKL˙

IG˙

IL, Seçil ARSLAN

AR-GE ve Özel Projeler

Yapı Kredi Teknoloji A.¸S.

Istanbul, TÜRK˙

IYE

{mehmetyasin.akpinar, erdem.emekligil, secil.arslan}@ykteknoloji.com.tr

Özetçe —Görüntü halindeki dokümanların dijital ve

i¸slenebilir formlara çevrilmesi günümüzde optik karakter tanıma

(OCR) araçlarıyla oldukça ba¸sarılı bir ¸sekilde yapılabilmektedir.

Ancak, orijinal belge üzerindeki biçimin korunması konusunda

hala problemler mevcuttur. Bu problemlerden önemli bir tanesi

ise tablo halindeki verinin okunmasıdır. Bu bildiride baskılı for-

mundan taranarak dijital ortama aktarılmı¸s belgeler üzerindeki

tablo içeriklerinin, bir OCR aracı ile okunarak karakter

pozisyonlarının da yardımıyla tekrar tablo formuna getirilerek

saklanmasını sa˘

glayan bir yöntem önerilmektedir. Yöntemin

ba¸sarımı tespit edilen satır ve sütun sayılarıyla ölçülmü¸s olup,

ticari olarak satılmakta olan ba¸ska ürünlerle kıyaslanarak sunul-

mu¸stur.

Anahtar Kelimeler—Tablo Tanıma, Optik Karakter Tanıma,

Metin ˙

I¸sleme.

Abstract—The conversion of image-based documents into

digital and processible forms can be accomplished quite success-

fully with optical character recognition (OCR) tools. However,

there are still problems with preserving the format on the original

document. An important one of these problems is the reading of

the tabular data. In this paper, a method is proposed in which the

tabular data contents of hard-copy documents is extracted from

the text and character positions which are obtained from an OCR

tool and transferred to digital forms. The performance of the

method is measured by the number of detected rows and columns

and presented with the results of other commercial products.

Keywords—Table Recognition, Optical Character Recognition,

Text Processing.

I. G˙

IR˙

I ¸S

Basılı dokümanlardan bilgi çıkarma zaman zaman herkesin

ihtiyacı olan bir durumdur. Bu sebeple optik karakter tanıma

(OCR) araçları oldukça faydalı kullanımlar sunmaktadır. An-

cak, bazı durumlarda bu araçlar yetersiz kalabilmektedir. Basılı

bir belgede yer alan bir tablonun dijital ortama formu ko-

runarak aktarılması da bunun en güzel örneklerinden birisidir.

Ticari ve açık-kaynak ürünlerde bu probleme çözüm üretilmeye

çalı¸sılsa da yapılan testlerde istenilen ba¸sarımın sa˘

glanamadı˘

gı

görülmü¸stür.

Bu bildiride, bahsedilen problemi çözmek için bir yön-

tem önerilmektedir. Yöntem gerçeklenirken belgelerdeki tablo-

ların bütün sayfa içeri˘

gini kapladı˘

gı varsayılmı¸stır. Bir ba¸ska

deyi¸sle önerilen yöntem bütün sayfa metnini tek bir tabloya

yerle¸stirmeye çalı¸smaktadır.

Bildirinin akı¸sı ise ¸su ¸sekildedir. II. bölümde tekni˘

gin bi-

linen durumu kaynaklar ile birlikte özetlenmektedir. Ardından

III. bölümde önerilen yöntem detaylarıyla beraber verilmek-

tedir. IV. bölümde ise çalı¸smadan elde edilen sonuçlar hazır

ticari ürünlerle kıyaslanarak sunulmaktadır. Bu kıyaslama a¸sa-

masında tespit edilen satır ve sütun sayıları dikkate alınmı¸stır.

V. bölüm olan kapanı¸s bölümünde ise bildirinin özeti yapılıp

gelecek çalı¸smalardan bahsedilmi¸stir.

II. ˙

ILG˙

IL˙

IÇAL I ¸SM ALA R

Literatür taramasında kar¸sıla¸sılan çalı¸smalar ana 2 gruba

ayrılmaktadır. Birincisi, görüntü üzerinden metin bilgisi çıkar-

maya yarayan optik karakter tanıma teknolojileri, ikincisi

ise metin formatındaki dokümanlardan tablo olu¸sturma çalı¸s-

malarıdır.

Birinci gruba giren çalı¸smalara uzun yıllardır yo ˘

gun bir

¸sekilde ula¸sılmaktadır. Bu çalı¸smaların bir özetine [1] numaralı

bildiriden ula¸sılabilir. Ancak, son zamanlarda bu grubu giren

çalı¸smalar form de ˘

gi¸stirerek do ˘

gal ortamlarda metin tanıma [2]

[3], el yazısı tanıma [4] [5] [6] ve gerçek zamanlı metin tanıma

[7] [8] gibi alanlara yönelmi¸stir.

Ikinci gruptaki çalı¸smalar ise yine aynı zamanlara denk

gelmektedir. Ancak, bu konudaki çalı¸smalara daha seyrek rast-

lanmaktadır. 2003 yılında yayınlanan bir inceleme bildirisinde

(survey) tekni˘

gin o zamana kadarki durumu özetlenmektedir

[9]. Son zamanlardaki çalı¸smalara ise [10] ve [11] verilebilir.

Ayrıca, bu çalı¸smalarda da daha spesiﬁk alanlara yönelmi¸s

olanlar mevcuttur [12].

Bu iki yakla¸sımı birle¸stirip görüntü üzerinden tablo verisi

ayıklama çalı¸smaları ile pek sık olarak kar¸sıla¸sılmamaktadır.

[11]’de bu yönde bir yakla¸sım da mevcuttur. Ayrıca, bu alan-

daki çalı¸smaların son örnekleri [13] ve [14]’te verilmi¸stir.978-1-5386-1501-0/18/$31.00 c

2018 IEEE

Birbirine benzer çalı¸smaların tekrar tekrar yapılmasının

ana sebebi herkes için uygun bir yöntemin bulunamamasından

kaynaklanmaktadır. Bu bildirideki çalı¸sma da denenmi¸s olan

açık kaynak kodlu ve ticari ürünlerden yeterli performans

sa˘

glanamadı˘

gı için gerekli hale gelmi¸stir.

III. YÖNT EM

Giri¸s bölümünde bahsedildi ˘

gi gibi bu çalı¸smada bir dizi

sıralama ve gruplama basamakları arka arkaya kullanılarak

görüntü içerisinden metnin uygun bir formatta çekilmesi

sa˘

glanmı¸stır. Sistem girdi olarak görüntü formatından bir OCR

uygulaması (ABBYY FineReader) ile okunmu¸s kelimeleri ve

bu kelimelere ait pozisyon (sol, sa˘

g, üst ve alt piksel de˘

gerleri)

ve boyut (en ve boy) bilgilerini kullanmaktadır. Pozisyon

ve boyut bilgileri hesaplanırken kelimelere ait karakterlerin

OCR aracından elde edilen pozisyon bilgileri birle¸stirilmi¸stir.

Bütün adımlardaki piksel de˘

gerleri için sayfanın sol-üst nok-

tası ba¸slangıç noktası kabul edilmektedir. Bu adımdan sonra

problem bir akım (stream) i¸sleme problemi haline gelmi¸stir.

Uygulanan basamaklar kabaca a¸sa ˘

gıdaki gibi listelenebilir:

1) Sütunları tespit etme

2) Kelimeleri yukarıdan a¸sa˘

gıya do˘

gru sıralama ve satır-

lara ayırma

3) Her satırdaki kelimeleri soldan sa˘

ga do˘

gru sıralama

4) Kelimeleri sütunlara göre gruplama ve tablo yapısını

olu¸sturma

5) Birden fazla satırdan olu¸san hücreleri birle¸stirme (is-

te˘

ge ba˘

glı)

A. Sütun Tespiti

Önerilen yöntem gerçeklenirken öncelikle sütun tespiti

üzerine çalı¸sılmı¸stır. Bunun en önemli sebebi sütun sayısının

satır sayısından daha az olması ve bu sebeple sayfa e˘

gim-

lerinden daha az etkilenmesidir. sütunların tespit edilebilmesi

için her kelimenin yatay eksendeki orta noktalarının piksel

de˘

gerleri hesaplanarak bir histogram olu¸sturulmu¸stur. Bu his-

togramda kullanılan kutu (bin) sayısı empirik olarak sayfadaki

sütun sayısı * 10 ¸seklinde belirlenmi¸stir.

Sayfa kirlili˘

gi ya da format bozuklu˘

gundan meydana

gelebilecek yanlı¸s sütun tespitlerinin engellenebilmesi için be-

lirli bir e¸sik de ˘

gerin altındaki kutuların hiç dikkate alınmaması

gerekmektedir. Bunun için de histogram içerisinde en yüksek

de˘

gerin 1/4’ünden daha az de˘

gere sahip kutuların 0 olarak

kabul edilmesi sa˘

glanmı¸stır.

Histogram bilgisi bu ¸sekilde elde edildikten sonra ke-

limelerin yo˘

gunla¸stı ˘

gı kutuların bulunabilmesi için bir kayan

pencere (sliding window) yapısından yararlanılmı¸stır. Bu

adımda pencere sayısı, sayfadaki sütunların kutu sayısı bazında

ortalama uzunlukları göz önünde bulundurularak belirlenmi¸stir.

Test kümesinde yer alan sayfalarda kutu sayısı yukarıdaki for-

mülle belirlendi˘

gi takdirde yakla¸sık 5 kutuya denk gelmektedir

ve bu sebeple pencere sayısı 5 olarak kararla¸stırılmı¸stır. Hazır-

lanan pencere histogram datası üzerinde gezdirilerek pencere

içerisinde kalan kutulardan en yüksek de˘

gere sahip olan not

edilerek yeni bir histogram olu¸sturulmu¸stur. Bu uygulamanın

amacı lokal olarak en yüksek de˘

gere sahip olan kutuların tespi-

tini kolayla¸stırmaktır. Böylece yeni elde edilen histogramda

pencere sayısına e¸sit de ˘

gerdeki kutular lokal maksimumları

belirtmektedir ve sütunların orta noktaları bu kutular olarak

kabul edilmi¸stir.

¸Sekil 1: Do˘

gru Tespit Edilmi¸s Sütun Örnekleri

¸Sekil 1’de, yukarıda anlatılan yöntemle do˘

gru olarak tespit

edilmi¸s sütunlar gösterilmektedir. Tablo yapısını belirten her-

hangi bir çizgi ya da i¸saret bulunmadı˘

gı halde tüm sütunlar

ufak hata paylarıyla (bir kutu geni¸sli˘

ginden daha az) do˘

gru

olacak ¸sekilde tepit edilebilmi¸stir. Ancak, ¸Sekil 2’deki gibi

bir sütununda az bilgi içeren tablolarda önerilen yöntem bu

sütunları tespit edemeyebilmektedir. Bu durumun en önemli

sebebi sayfa kirlili˘

gi ya da format bozuklukları için alınan

1/4’lük önlemdir. Bu örneklerin do˘

gru sonuç vermesi bu

kontrolün kaldırılmasıyla mümkün olmasına ra˘

gmen, ba¸ska

belgelerde daha kötü sonuçlara sebep olabilmektedir. Bu se-

beple kontrolün bu ¸sekilde kalması daha uygun görülmü¸stür.

¸Sekil 2: Hatalı Tespit Edilmi¸s Sütun Örnekleri

B. Kelimelerin Sıralanması ve Satırlara Ayrılması

Sayfa üzerindeki sütun pozisyonları tespit edildikten sonra

kelimeler üst piksel de˘

gerleri göz önüne alınarak sıralanmı¸stır.

Sonrasında kelimelerin üst ve alt piksel de˘

gerleri kullanılarak

satır ayrımları yapılmı¸stır. Bu ayrım yapılırken sayfa e˘

gik-

li˘

ginin bir miktar tolere edilebilmesi için hep bir kelimenin alt

piksel de˘

geriyle kendisinden sonra gelen kelimenin üst piksel

de˘

geri kıyaslanmı¸stır. Bu de˘

gerler örtü¸smedi ˘

ginde, yani bir

kelimenin alt piksel de˘

geri kendisinden sonra gelen kelimenin

üst piksel de˘

gerinden daha küçük oldu˘

gunda bu iki kelimenin

aynı satırda olmadı˘

gı kabul edilerek sonraki kelimenin yeni bir

satıra yerle¸stirilmesi sa ˘

glanmı¸stır.

C. Satırlardaki Kelimelerin Sıralanması

Bu a¸samada kelimelerin ait oldukları satırlar içerisinde

soldan sa˘

ga sıralanmaları sa˘

glanmı¸stır. Ancak, birden fazla

satırdan olu¸san hücrelerdeki sıralamanın bozulmaması için bu

kontrol üst piksel de˘

gerlerini de içerecek ¸sekilde düzenlen-

mi¸stir. Böylece alt alta olan kelimelerin sol piksel de˘

ger-

lerine bakılmaksızın üstte olanı daha önde yer alacak hale

getirilmi¸stir.

D. Kelimelerin Sütunlara Ayrılması

Kelimeler sıralandıktan sonra sıralamalarına göre en yakın

oldukları sütuna atanmı¸stır. Bu atamalar yapılırken de her

kelime arasında bir bo¸sluk olacak ¸sekilde ba ˘

glama (concate-

nate) i¸slemi yapılmı¸stır. Bu i¸slem sonucunda elde edilen yapı,

ula¸sılmak istenen tablo yapısının ilk halidir ve çok düzgün

sayfalarda, örne˘

gin hiç bir hücresinde birden fazla satırlık

bilgi bulunmayan tablolarda, yeterli seviyede ba¸sarım göstere-

bilmektedir. Ancak, çalı¸sılan belgelerde sıklıkla bir hücre bir-

den fazla satırlık de˘

ger alabildi˘

gi için tablo içeri˘

ginde yalnızca

bir sütunu dolu olan satırlar olu¸smaktadır. ¸Sekil 3’te bu duruma

bir örnek gösterilmektedir. Tüm örnek tek satırlık bir bilgi

içermesine ra˘

gmen yukarıdaki adımlar sonucunda 4.sütunda

bulunan bilgi 3 satırlık yer kapladı˘

gı için bu bilgiler fazlalık

satırlar olarak sonuçlanmaktadır. Bu örnekler için de opsiyonel

olan E adımı uygulanarak birle¸sim sa ˘

glanabilmektedir.

¸Sekil 3: Hatalı Ayrılmı¸s Satır Örne˘

E. Satırların Birle¸stirilmesi

Bahsedilen durumun çözülebilmesi için satır birle¸stirici bir

adım daha eklenmek durumunda kalınmı¸stır. Bu birle¸stirici,

öncelikle tabloyu yukarıdan a¸sa ˘

gıya do˘

gru tarayarak dolu olan

hücre sayılarına göre birle¸stirme yapılıp yapılmayaca ˘

gına karar

verir. Daha sonra ise tespit edilen az içeri˘

ge sahip sütunların

üst satıra mı yoksa alt satıra mı birle¸stirilece ˘

gini yakınlık du-

rumuna göre hesaplar. Bunun sonucuna göre de ilgili hücreleri

sırasına göre aralarda birer bo¸sluk bırakacak ¸sekilde ba˘

gla-

yarak (concatenate) birle¸stirme i¸slemini gerçekle¸stirir.

Satırların birle¸sip birle¸smeyece˘

ginin kararının verilebilmesi

için ise ardı¸sık ikili satırların içerikleri incelenmi¸stir. ˙

Incelenen

iki satırın aynı anda bilgi içeren sütun sayısı 2 veya daha az

ise bu iki satırın aslında tek bir satırlık bilgi içerdi˘

gi varsayımı

yapılmı¸stır. Yani ¸Sekil 3’teki örne˘

gi ele alacak olursak, yal-

nızca 4.sütunda ardı¸sık iki satırda birden bilgi bulunmaktadır.

Di˘

ger sütunlarda ya ilk satır ya da ikinci satır bilgi içer-

memektedir. Dolayısıyla bu üç satırın 2 adımda birle¸stirilmesi

uygundur. Buradaki kontrolün 2 olarak belirlenmesinin sebebi

belge kirlili˘

ginin yol açabilece˘

gi karakter okumalarıdır. E˘

ger

belge içeri˘

ginde birden fazla satırlık bilgi içeren hücreler

birden daha fazla sütunda bulunuyorsa, bu sayının bu tip sütun

sayısı + 1 ¸seklinde belirlenmesi yerinde olacaktır.

IV. SONUÇLAR VE KAR ¸SI LA ¸S TIR MA

Bu bölümde önerilen sistemin ba¸sarımları ticari olarak

satılmakta olan ürünlerle kar¸sıla¸stırarak sunulmu¸stur. 58 belge-

lik test kümesi üzerinde do˘

gruluk bilgilerinin çıkarılması için

bir yorumcu (annotator) ile çalı¸sılmı¸stır. Yorumcunun görevi

test kümesi içerisindeki her belgeyi inceleyerek satır ve sütun

sayılarının not edilmesi ile referans de˘

gerlerinin belirlemek

olmu¸stur.

Ticari ürünlerden ilki, çıktıları önerilen yöntemde de kul-

lanılanılan ABBYY FineReader 11 Release 8 versiyonudur. Bu

ürün çalı¸smadan önce de kullanılmakta ve tablo okuma per-

formansındaki problemler sebebiyle önerilen yöntemin ortaya

çıkmasında önemli bir rol oynamaktadır. Bu ürünün en önemli

problemleri sayfayı bloklar halinde okurken tabloyu da ikiye

veya daha fazla sayıya bölerek okuması, bazen de hiç tablo

tespit edememesidir.

Bu ürünün ba¸sarımı ölçülürken okunması istenen belgede

tespit edebildi˘

gi en büyük tablo yapısı göz önüne alınmı¸stır.

Örne˘

gin bir sayfada 27x5 ve 26x2 boyutlarında tablo tespit

ediyorsa 27 satır ve 5 sütun tespit edebildi˘

gi kabul edilmi¸stir.

Bir di˘

ger ticari ürün de Readiris Pro 16 versiyonudur. Bu

ürün görüntü formatındaki dosyaları okuyarak docx, pdf, xlsx

ve bir kaç farklı formda daha kaydetme imkanı sa˘

glamak-

tadır. Ürünün en büyük problemleri tablo yapısını olu¸stururken

bo¸s sütun ve satırlar bulması, biçim özelliklerini yeterince

düzgün kopyalayamamasıdır. Ayrıca, bilgileri çok fazla ayır-

maya giderek birden fazla satırdan olu¸san hücreleri iyi tespit

edememektedir.

Ba¸sarım ölçümleri için okunan belgeler xlsx formatında

kaydedilmi¸s, sonrasında ise bu dosyalar açılarak tespit edilen

satır ve sütun sayıları not edilmi¸stir. Bu a¸samada bo¸s satırlar

dikkate alınmamı¸s ve hesaplama dı¸sında tutulmu¸stur.

¸Sekil 4’te her bir ürün ve önerilen yöntem için satır ve

sütun sayısı yanlı¸s tespit edilmi¸s belge sayıları sunulmaktadır.

Bu sayılar hesaplanırken ilgili ürünün tespit etti˘

gi satır ve sütun

sayıları daha önce yorumcu tarafından belirlenmi¸s satır ve sü-

tun sayılarıyla kıya¸slanmı¸stır. Bu sayılar e¸sit olmadı ˘

gı durumda

ilgili ürün için hatalı belge sayısı 1 artırılmı¸stır. Hem satır bazlı

hem de sütun bazlı sonuçlar çıkarılarak görselle¸stirilmi¸stir.

¸Sekil 4: Satır Ve Sütun Sayıları Yanlı¸s Tespit Edilen Belge

Sayıları

Satır bazlı sonuçlarda Readiris en kötü performansı gös-

terirken (52/58 hatalı tespit), önerilen yöntem en iyi perfor-

mansa sahiptir (4/58 hatalı tespit). Sütun bazlı sonuçlarda ise

ABBYY ürünü bazı belgelerde birden fazla tablo tespit etti˘

ve bu tablolar dikey olarak bölündü˘

gü için en kötü performansı

sergilemi¸stir (22/58 hatalı tespit). Bu hesaplamada da yine en

iyi ba¸sarım önerilen yönteme aittir (6/58 hatalı tespit). Ayrıca,

ABBYY ürünü 3 belgede hiç tablo tespit edememi¸stir.

Tablo I’de ise her ürün ve önerilen yöntem için test kümesi

üzerinde tespit edilen satır ve sütun sayıları kümülatif olarak

toplanarak, referans de˘

gerleriyle kıyaslanmı¸stır. Bu sonuçlar da

yine satır ve sütun bazlı olarak ikiye ayrılmı¸stır.

Tablo I: Toplam Satır ve Sütun Tespitleri

Satır Sütun

Sayı Fark Fark (%) Sayı Fark Fark (%)

Referans 1686 - - 418 - -

Abbyy 1554 -132 -7,83 365 -53 -12,68

ReadIris 2871 +1185 +70,28 423 +5 +1,20

Önerilen

Yöntem 1677 -9 -0,53 411 -7 -1,67

Satır bazlı sonuçlarda Readiris ürünü satır birle¸stirme yap-

madı˘

gı için oldukça kötü bir sonuç vermi¸stir ve yakla¸sık

%70 fazladan satır tespitinde bulunmu¸stur. ABBYY ürünü

ise referans de˘

gerlerine göre %7,83’lük eksik tespit ortaya

çıkarmı¸stır. Buna kar¸sılık önerilen yöntem %0,53 fark ile

önde kalmayı ba¸sarmı¸stır. Bu fark, tüm test kümesi üzerinde

yalnızca 9 eksik satır tespitine tekabül etmektedir.

Sütun bazlı sonuçlarda ise daha önce bahsedilen problem-

lerden dolayı ABBYY ürünü en kötü sonucu vermi¸stir (%12,68

eksik tespit). Ancak, Readiris ürünü belge bazında önerilen

yöntemden geride olmasına ra˘

gmen bu hesaplamada %1,20

fazla tespit ile en yüksek performansı göstermi¸stir. Önerilen

yöntem ise %1,67 eksik tespit ile hemen arkasında yer almı¸stır.

Sonuç olarak hem satır hem de sütun bazlı kıyaslamalar bir-

likte göz önüne alındı˘

gında önerilen yöntem bu iki ticari ürüne

göre bariz bir üstünlük sa˘

glayabilmektedir. Bu durumun önemli

sebeplerinden bir tanesi, önerilen yöntemin hedef odaklı olup

bütün OCR metnini bir tabloya çevirmeye çalı¸smasıdır. Di˘

ger

ürünlerde böyle bir durum söz konusu de˘

gildir.

V. KAPA NI ¸S

Bu bildiride tablo yapısında bilgi içeren basılı örnek-

lerden bilgilerin formu korunarak çıkarılmasını sa˘

glayan bir

yöntem önerilmi¸stir. Giri¸s bölümünde yapılan çalı¸smanın

genel amacı açıklanıp, bildiri akı¸sından bahsedilmi¸stir. Ardın-

dan ˙

Ilgili Çalı¸smalar bölümünde tekni ˘

gin bilinen duru-

muna de˘

ginilmi¸stir. 3.bölüm olan Yöntem bölümünde yapılan

çalı¸sma detaylarıyla açıklanmı¸stır. Bu yöntem ve mevcutta

bulunan ticari ürünler kullanılarak Sonuçlar ve Kar¸sıla¸stırma

bölümünde kıyaslamalı bir ¸sekilde performans ölçümleri

yapılmı¸stır.

Mevcut proje bir ba¸slangıç çalı¸sması olmakla birlikte

geli¸stirilebilir yanları fazladır. Örne˘

gin, sayfa içerikleri e˘

gik

gelen tarama örneklerinde bu durumun tolere edilebilmesi

için satır ve sütunların yatay ve dikey olarak de˘

gil, e˘

gimli

olarak tespit edilmesi gerekmektedir. Ayrıca, daha ileri

teknikler kullanılarak sayfa üzerindeki biçim özelliklerinin de

kopyalanabilmesi mümkündür. Bu çalı¸smaların da ilerleyen

zamanlarda yapılması planlanmaktadır.

TE ¸S EKK ÜR

Bu çalı¸smamız TÜB˙

ITAK TEYDEB tarafından 3160184

no’lu proje kapsamında desteklenmi¸stir.

KAYNAKÇ A

[1] Islam, N., Islam, Z., & Noor, N. (2016). A Survey on Optical Char-

acter Recognition System. Journal of Information & Communication

Technology-JICT Vol. 10 Issue. 2.

[2] Baran, R., Partila, P., & Wilk, R. (2018, January). Automated Text Detec-

tion and Character Recognition in Natural Scenes Based on Local Image

Features and Contour Processing Techniques. In International Conference

on Intelligent Human Systems Integration (pp. 42-48). Springer, Cham.

[3] Shabana, M. A., Jose, A., & Sunny, A. (2018). TEXT DETECTION

AND RECOGNITION IN NATURAL IMAGES.

[4] Kumar, P., Saini, R., Roy, P. P., & Pal, U. (2018). A lexicon-free approach

for 3D handwriting recognition using classiﬁer combination. Pattern

Recognition Letters, 103, 1-7.

[5] Samanta, O., Roy, A., Parui, S. K., & Bhattacharya, U. (2018). An

HMM Framework based on Spherical-Linear Features for Online Cursive

Handwriting Recognition. Information Sciences.

[6] Sueiras, J., Ruiz, V., Sanchez, A., & Velez, J. F. (2018). Ofﬂine Con-

tinuous Handwriting Recognition Using Sequence to Sequence Neural

Networks. Neurocomputing.

[7] Chauhan, R., & Pipalia, D. (2018). Smart Electronic Real Time Text

Recognition Application. Journal of Electronic Design Technology, 8(3),

1-7.

[8] Liu, Z., Li, Y., Ren, F., Yu, H., & Goh, W. (2018). SqueezedText: A Real-

time Scene Text Recognition by Binary Convolutional Encoder-decoder

Network.

[9] Zanibbi, R., Blostein, D., & Cordy, J. R. (2004). A survey of table

recognition. Document Analysis and Recognition, 7(1), 1-16.

[10] Yildiz, B., Kaiser, K., & Miksch, S. (2005, December). pdf2table: A

method to extract table information from pdf ﬁles. In IICAI (pp. 1773-

1785).

[11] Coüasnon, B., & Lemaitre, A. (2014). Recognition of Tables and Forms.

Handbook of Document Image Processing and Recognition, 2014.

[12] Parikh, R., & Vasant, A. (2013). Table of Content Detection using

Machine Learning: Proposed System. International Journal of Artiﬁcial

Intelligence & Applications, 4(3), 13.

[13] Bansal, A., Harit, G., & Roy, S. D. (2014, December). Table Extraction

from Document Images using Fixed Point Model. In Proceedings of

the 2014 Indian Conference on Computer Vision Graphics and Image

Processing (p. 67). ACM.

[14] Vasileiadis, M., Kaklanis, N., Votis, K., & Tzovaras, D. (2017, April).

Extraction of Tabular Data from Document Images. In Proceedings of

the 14th Web for All Conference on The Future of Accessible Work (p.

24). ACM.

Information Extraction from Text Intensive and Visually Rich Banking Documents

Article

Sep 2020

Document types, where visual and textual information plays an important role in their analysis and understanding, pose a new and attractive area for information extraction research. Although cheques, invoices, and receipts have been studied in some previous multi-modal studies, banking documents present an unexplored area due to the naturalness of the text they possess in addition to their visual richness. This article presents the first study which uses visual and textual information for deep-learning based information extraction on text-intensive and visually rich scanned documents which are, in this instance, unstructured banking documents, or more precisely, money transfer orders. The impact of using different neural word representations (i.e., FastText, ELMo, and BERT) on IE subtasks (namely, named entity recognition and relation extraction stages), positional features of words on document images and auxiliary learning with some other tasks are investigated. The article proposes a new relation extraction algorithm based on graph factorization to solve the complex relation extraction problem where the relations within documents are n-ary, nested, document-level, and previously indeterminate in quantity. Our experiments revealed that the use of deep learning algorithms yielded around 10 percentage points improvement on the IE sub-tasks. The inclusion of word positional features yielded around 3 percentage points of improvement in some specific information fields. Similarly, our auxiliary learning experiments yielded around 2 percentage points of improvement on some information fields associated with the specific transaction type detected by our auxiliary task. The integration of the information extraction system into a real banking environment reduced cycle times substantially. When compared to the manual workflow, document processing pipeline shortened book-to-book money transfers to 10 minutes (from 29 min.) and electronic fund transfers (EFT) to 17 minutes (from 41 min.) respectively.

FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents

Preprint

May 2019

In this paper, we present a new dataset for Form Understanding in Noisy Scanned Documents (FUNSD). Form Understanding (FoUn) aims at extracting and structuring the textual content of forms. The dataset comprises 200 fully annotated real scanned forms. The documents are noisy and exhibit large variabilities in their representation making FoUn a challenging task. The proposed dataset can be used for various tasks including text detection, optical character recognition (OCR), spatial layout analysis and entity labeling/linking. To the best of our knowledge this is the first publicly available dataset with comprehensive annotations addressing the FoUn task. We also present a set of baselines and introduce metrics to evaluate performance on the FUNSD dataset. The FUNSD dataset can be downloaded at https://guillaumejaume.github. io/FUNSD/.

An approach towards the development of refreshable Braille Computer Display Unit

Conference Paper

Mar 2019

Modern time is the age of digitalization. People are accessing information with the advancement of digital technology. Blinds have very limited accessible resources to access this digitalized information. In this article, a novel proposal about working and development of an electronic refreshable braille display unit to build a getaway to the digital world for the blind people. An economic adoption of a tactile display having an array of six small independent vibrator motors are arranged in an array of a 3×2 matrix. The system would be enabled to initially capture information from computer screen then with suitable image processing, characters recognition and stimulation of target motor/s would enable enhanced readability of the texts.

FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents

Conference Paper

Sep 2019

OCR-based Solution for The Integration of Legacy And-Or Non-Electric Counters in Cloud Smart Grids

Conference Paper

Oct 2018

Optical Character Recognition (OCR) can provide tele-measurement for legacy utility meters (water, natural gas, electricity etc) and can be accepted as a low-cost extension for the capabilities of these counters and for their time of life. On the other side, the “galvanic separation” provided by optical acquisition of displayed counter values offers the confidence of secure reading - for natural gas meters and similar instruments. Our cost-effective solution enables simple snapshots of counters displays (with simple devices like an “ArdtrCarn” - camera for Arduino embedded solutions), local “edge computing” for image packing (in. BMP format) and tele-transmission via the WiFi hot-spot of the intelligent building. At the server-side, National Instruments software sub-systems - NI Vision Development Module and LabView- provide OCR and cloud aggregation of measured values. Our solution is related to a research project for cloud-based smart metering (the LabVIEW sub-system belongs to the communications work-package). Specific to our contribution is the re-balancing of local and centralized processing, involving an assisted learning phase for the OCR.

A Survey on Optical Character Recognition System

Article

Full-text available

Dec 2016

Optical Character Recognition (OCR) has been a topic of interest for many years. It is defined as the process of digitizing a document image into its constituent characters. Despite decades of intense research, developing OCR with capabilities comparable to that of human still remains an open challenge. Due to this challenging nature, researchers from industry and academic circles have directed their attentions towards Optical Character Recognition. Over the last few years, the number of academic laboratories and companies involved in research on Character Recognition has increased dramatically. This research aims at summarizing the research so far done in the field of OCR. It provides an overview of different aspects of OCR and discusses corresponding proposals aimed at resolving issues of OCR.

Recognition of Tables and Forms

Article

Full-text available

Jan 2014

Tables and forms are a very common way to organize information in structured documents. Their recognition is fundamental for the recognition of the documents. Indeed, the physical organization of a table or a form gives a lot of information concerning the logical meaning of the content. This chapter presents the different tasks that are related to the recognition of tables and forms and the associated well-known methods and remaining challenges. Three main tasks are pointed out: the detection of tables in heterogeneous documents; the classification of tables and forms, according to predefined models; and the recognition of table and form contents. The complexity of these three tasks is related to the kind of studied document: image-based document or digital-born documents. At last, this chapter will introduce some existing systems for table and form analysis.

Table of Content detection using Machine Learning

Article

Full-text available

Jun 2013

Table of content (TOC) detection has drawn attention now a day because it plays an important role in digitization of multipage document. Generally book document is multipage document. So it becomes necessary to detect Table of Content page for easy navigation of multipage document and also to make information retrieval faster for desirable data from the multipage document. All the Table of content pages follow the different layout, different way of presenting the contents of the document like chapter, section, subsection etc. This paper introduces a new method to detect Table of content using machine learning technique with different features. With the main aim to detect Table of Content pages is to structure the document according to their contents.

pdf2table: A Method to Extract Table Information from PDF Files.

Conference Paper

Full-text available

Jan 2005

Tables are a common structuring element in many documents, such as PDF files. To reuse such tables, appropriate methods need to b e develop, which capture the structure and the content information. We have developed several heuristics which together recognize and decompose tables in PDF files and store the extracted data in a structured data format (XML) for easier reuse. Addition- ally, we implemented a prototype, which gives the user the ability of making adjustments on the extracted data. Our work shows that purely heuristic-based approaches can achieve good results, especially for lucid t ables.

SqueezedText: A Real-Time Scene Text Recognition by Binary Convolutional Encoder-Decoder Network

Article

Apr 2018

A new approach for real-time scene text recognition is proposed in this paper. A novel binary convolutional encoder-decoder network (B-CEDNet) together with a bidirectional recurrent neural network (Bi-RNN). The B-CEDNet is engaged as a visual front-end to provide elaborated character detection, and a back-end Bi-RNN performs character-level sequential correction and classification based on learned contextual knowledge. The front-end B-CEDNet can process multiple regions containing characters using a one-off forward operation, and is trained under binary constraints with significant compression. Hence it leads to both remarkable inference run-time speedup as well as memory usage reduction. With the elaborated character detection, the back-end Bi-RNN merely processes a low dimension feature sequence with category and spatial information of extracted characters for sequence correction and classification. By training with over 1,000,000 synthetic scene text images, the B-CEDNet achieves a recall rate of 0.86, precision of 0.88 and F-score of 0.87 on ICDAR-03 and ICDAR-13. With the correction and classification by Bi-RNN, the proposed real-time scene text recognition achieves state-of-the-art accuracy while only consumes less than 1-ms inference run-time. The flow processing flow is realized on GPU with a small network size of 1.01 MB for B-CEDNet and 3.23 MB for Bi-RNN, which is much faster and smaller than the existing solutions.

An HMM Framework based on Spherical-Linear Features for Online Cursive Handwriting Recognition

Article

Feb 2018
INFORM SCIENCES

In this paper a Hidden Markov Model (HMM) based writer independent online unconstrained handwritten word recognition scheme is proposed. The main steps here are segmentation of handwritten word samples into sub-strokes, feature extraction from the sub-strokes and recognition. We propose a novel but simple strategy based on the well-known discrete curve evolution for the segmentation task. Next, certain angular and linear features are extracted from the sub-strokes of word samples and are modelled as feature vectors generated from a mixture distribution. This mixture model is designed to accommodate the correlation among the angular variables. We formulate a Baum-Welch parameter estimation algorithm that can handle spherical-linear correlated data to construct an HMM. Finally, based on this HMM, we design a classifier for recognition of handwritten word samples. Simulation trials have been conducted on handwritten word sample databases of Latin and Bangla scripts demonstrating successful performance of the proposed recognition scheme.

Offline Continuous Handwriting Recognition Using Sequence to Sequence Neural Networks

Article

Feb 2018
NEUROCOMPUTING

This paper proposes the use of a new neural network architecture that combines a deep convolutional neural network with an encoder-decoder, called sequence to sequence, to solve the problem of recognizing isolated handwritten words. The proposed architecture aims to identify the characters and contextualize them with their neighbors to recognize any given word. Our model proposes a novel way to extract relevant visual features from a word image. It combines the use of a horizontal sliding window, to extract image patches, and the application of the LeNet-5 convolutional architecture to identify the characters. Extracted features are modeled using a sequence-to-sequence architecture to encode the visual characteristics and then to decode the sequence of characters in the handwritten text image. We test the proposed model on two handwritten databases (IAM and RIMES) under several experiments to determine the optimal parameterization of the model. Competitive results above those presented in the current state-of-the-art, on handwriting models, are achieved. Without using any language model and with closed dictionary, we obtain a word error rate in the test set of 12.7% in IAM and 6.6% in RIMES.

A lexicon-free approach for 3D handwriting recognition using classifier combination

Article

Dec 2017
PATTERN RECOGN LETT

Recent developments in depth sensing technology such as Leap Motion have opened novel directions in Human-Computer-Interaction (HCI) research domain. The sensor extends the way of writing from traditional method to a gesture based writing in the 3D space. The online text written in 3D space over the sensorâs viewing field is different from traditional 2D handwriting in several ways. The 3D handwriting does not consist any stroke information, since all characters are connected by a single stroke. Moreover, non-uniform text styles and jitters during writing in 3D space create additional challenge for the recognition task. Because of these challenges in 3D handwriting, recognition of cursive words is not satisfied using a single classifier. In this paper, we present a lexicon free approach for the recognition of 3D handwritten words in Latin and Devanagari scripts by combining multiple classifiers. The individual recognition systems are computed using Bidirectional Long-Short Term Memory Neural Network (BLSTM-NN) classifier with the help of different features. The combination of multiple classifier is performed by aligning the output word sequence of each classifier using the Recognizer Output Voting Error Reduction (ROVER) framework. Accuracies of 72.25% and 71.86% are recorded using the proposed methodology for Latin and Devanagari scripts, respectively.

Extraction of Tabular Data from Document Images

Conference Paper

Apr 2017

In this paper, we propose a heuristics-based method for automatic detection and extraction of tabular data from document images. The proposed approach utilizes page segmentation techniques, along with an OCR engine, in order to acquire the text data and bounding boxes of each word in the document. These elements are then grouped in a bottom-up fashion, based on a series of rules, in order to identify and reconstruct tabular arrangements of data. Based on this methodology, an open source cross-platform tool capable of recognizing the semantic structure of documents containing tabular data has been implemented, thus widening the range of document types than can be successfully converted into alternative accessible formats, suitable for users with visual impairments.

Table Extraction from Document Images using Fixed Point Model

Conference Paper

Dec 2014

The paper presents a novel learning-based framework to identify tables from scanned document images. The approach is designed as a structured labeling problem, which learns the layout of the document and labels its various entities as table header, table trailer, table cell and non-table region. We develop features which encode the foreground block characteristics and the contextual information. These features are provided to a fixed point model which learns the inter-relationship between the blocks. The fixed point model attains a contraction mapping and provides a unique label to each block. We compare the results with Condition Random Fields(CRFs). Unlike CRFs, the fixed point model captures the context information in terms of the neighbourhood layout more efficiently. Experiments on the images picked from UW-III (University of Washington) dataset, UNLV dataset and our dataset consisting of document images with multicolumn page layout, show the applicability of our algorithm in layout analysis and table detection.

Extracting table data from images using optical character recognition text

Recommended publications

TEXT2TABLE

Extraction of indicative summary sentences from imaged documents

Table Recognition and Evaluation

TEXT2TABLE: Medical text summarization system based on named entity recognition and modality identif...