ArticlePDF Available

Latest trends in hybrid machine translation and its applications

Authors:

Abstract and Figures

This survey on hybrid machine translation (MT) is motivated by the fact that hybridization techniques have become popular as they attempt to combine the best characteristics of highly advanced pure rule or corpus-based MT approaches. Existing research typically covers either simple or more complex architectures guided by either rule or corpus-based approaches. The goal is to combine the best properties of each type. This survey provides a detailed overview of the modification of the standard rule-based architecture to include statistical knowledge, the introduction of rules in corpus-based approaches, and the hybridization of approaches within this last single category. The principal aim here is to cover the leading research and progress in this field of MT and in several related applications.
Content may be subject to copyright.
Available
online
at
www.sciencedirect.com
ScienceDirect
Computer
Speech
and
Language
32
(2015)
3–10
Latest
trends
in
hybrid
machine
translation
and
its
applications
Marta
R.
Costa-jussà a,,
José
A.R.
Fonollosa b
aInstitute
for
Infocomm
Research,
1
Fusionopolis
Way ,
Singapore
138632,
Singapore
bUniversitat
Politècnica
de
Catalunya,
Jordi
Girona,
Barcelona
08034,
Spain
Received
31
October
2014;
accepted
1
November
2014
Available
online
15
November
2014
Abstract
This
survey
on
hybrid
machine
translation
(MT)
is
motivated
by
the
fact
that
hybridization
techniques
have
become
popular
as
they
attempt
to
combine
the
best
characteristics
of
highly
advanced
pure
rule
or
corpus-based
MT
approaches.
Existing
research
typically
covers
either
simple
or
more
complex
architectures
guided
by
either
rule
or
corpus-based
approaches.
The
goal
is
to
combine
the
best
properties
of
each
type.
This
survey
provides
a
detailed
overview
of
the
modification
of
the
standard
rule-based
architecture
to
include
statistical
knowl-
edge,
the
introduction
of
rules
in
corpus-based
approaches,
and
the
hybridization
of
approaches
within
this
last
single
category.
The
principal
aim
here
is
to
cover
the
leading
research
and
progress
in
this
field
of
MT
and
in
several
related
applications.
©
2014
The
Authors.
Published
by
Elsevier
Ltd.
This
is
an
open
access
article
under
the
CC
BY-NC-ND
license
(http://creativecommons.org/licenses/by-nc-nd/3.0/).
MSC:
00-01;
99-00
Keywords:
Hybridization;
Machine
translation;
Corpus;
Rules;
Applications
1.
Introduction
Machine
translation
(MT)
is
the
area
of
natural
language
processing
(NLP)
that
focuses
on
obtaining
a
target
language
text
from
a
source
language
text
by
means
of
automatic
techniques.
MT
is
a
multidisciplinary
field
and
the
challenge
has
been
approached
from
various
points
of
view
including
linguistics
and
statistics.
The
existence
of
different
perspectives
has
made
possible
the
proliferation
of
hybrid
methodologies.
Hybrid
methods
focus
on
combining
the
best
properties
of
two
or
more
MT
approaches.
Nowadays,
it
has
become
very
popular
to
include
rules
in
statistical
MT
(SMT)
approaches.
However,
there
are
also
relevant
works
on
enhancing
standard
rule-based
MT
(RBMT)
by
adding
statistical
knowledge.
Recent
initiatives
such
as
the
three
editions
of
the
HyTra
workshop1
show
that
linguists,
engineers
and
computer
scientists
actively
interact
in
the
interests
of
building
successful
hybrid
architectures,
formulating
proposals
and
conducting
experiments.
This
survey
paper
reviews
recent
methods
that
combine
and
hybridize
MT
approaches
in
single
architectures,
and
thus,
two
closely
related
lines
of
research
fall
outside
our
scope.
First,
the
methodologies
of
multi-engine
combination,
This
paper
has
been
recommended
for
acceptance
by
Roger
K.
Moore.
Corresponding
author.
Current
address:
Instituto
Politécnico
Nacional,
Mexico.
Tel.:
+51
1
5525298370.
1http://parles.upf.edu/llocs/plambert/hytra/hytra2014/.
http://dx.doi.org/10.1016/j.csl.2014.11.001
0885-2308/©
2014
The
Authors.
Published
by
Elsevier
Ltd.
This
is
an
open
access
article
under
the
CC
BY-NC-ND
license
(http://creativecommons.org/licenses/by-nc-nd/3.0/).
4
M.R.
Costa-jussà,
J.A.R.
Fonollosa
/
Computer
Speech
and
Language
32
(2015)
3–10
which
have
been
widely
studied
in
MT,2as
well
as
in
other
related
areas
(e.g.
speech
recognition).
These
approaches
assemble
MT
outputs,
not
MT
architectures.
And
second,
the
integration
of
linguistic
knowledge
into
SMT
when
studies
do
not
consider
different
MT
paradigms.
For
a
survey
on
this
specific
topic
see
Costa-jussà
and
Farrús
(2014).
The
rest
of
the
paper
is
organized
as
follows.
Section
2
explains
two
classifications
of
MT
approaches.
Section
3
reports
the
main
hybridization
methods
within
and
across
paradigms.
Section
4
describes
several
MT
applications
with
hybrid
components.
Finally,
Section
5
summarizes
the
main
findings
of
this
survey.
2.
Classification
of
machine
translation
Basically,
MT
approaches
can
be
classified
into
different
paradigms
using
two
criteria:
either
the
level
of
represen-
tation
or
the
sources
of
information.
2.1.
Level
of
representation
When
classifying
MT
by
level
of
representation,
we
can
think
of
the
Vauquois
pyramid
that
basically
contains:
direct,
transfer
and
interlingua
approaches.
Direct.
Approaches
at
the
bottom
of
the
Vauquois
pyramid
require
one
single
step
transformation
between
source
and
target,
without
analysis
of
the
source
language
and
without
generation
of
the
target
language.
Within
this
category,
we
might
find
simple
dictionary-based
translations.
Transfer.
Approaches
in
the
middle
of
the
Vauquois
pyramid
consist
of
three
steps:
analysis,
transfer
and
generation.
This
category
includes
RBMT,
EBMT
and
SMT
approaches.
Interlingua.
Approaches
at
the
top
of
the
Vauquois
pyramid
consist
of
two
steps:
analysis
and
generation.
The
analysis
transforms
the
source
language
into
the
interlingua
representation
and
the
generation
transforms
this
interlingua
representation
into
the
target
language.
Interlingua
is
a
universal
representation
of
all
languages,
needing
no
transfer
stage.
Wu
(2005)
offers
observations
as
to
whether
a
system
can
be
considered
direct
or
transfer
depending
largely
on
how
much
or
how
little
language-specific
monolingual
analysis
is
carried
out
and
also
how
close
the
intermediate
representations
are
to
the
source
and
target
texts
themselves.
Essentially,
most
of
the
approaches
(other
than
interlingua)
mentioned
in
this
article
could
be
classified
as
transfer-based
engines,
with
varying
degrees
of
complexity
in
their
transfer,
analysis
and
generation
stages.
2.2.
Sources
of
information
MT
sources
of
information
can
be
rules
or
data.
The
former
is
linguistically
motivated,
and
the
latter
is
more
statistically
motivated.
Rules.
MT
approaches
based
on
rules
(i.e.
RBMT)
use
linguistic
information
such
as
monolingual
and
bilingual
dictionaries
combined
with
human
linguistic
knowledge.
Rules
are
developed
manually
to
transfer
text
in
a
source
language
text
into
a
target
language
text.
Most
popular
RBMT
approaches
apply
three
different
phases:
analysis,
transfer
and
generation.
Data.
Data-driven
MT
approaches
use
information
from
data
and
complex
algorithms
which
together
are
capable
of
modeling
translation.
Data
driven
MT
includes:
example
(EBMT)
and
statistical-based
(SMT).
By
definition,
EBMT
approaches
perform
a
direct
translation
by
analogy
and
it
can
be
seen
as
a
pattern
matching
problem.
Unlike
these,
SMT
systems
try
to
find
the
most
probable
translation
given
the
source
sentence,
by
reference
to
the
models
built
using
data
such
as
the
translation
and
language
model
(Brown
et
al.,
1993).
SMT
can
be
classified
into
phrase,
syntax
and
hierarchical.
The
main
difference
among
these
models
is
the
structure
of
the
bilingual
units
which
can
be
built
from:
(1)
plain
text
in
the
case
of
phrase
models;
(2)
more
complex
data
including
grammars
and
dependency
trees
in
syntax
models;
and
(3)
plain
text
but
allowing
hierarchical
units
in
hierarchical
systems.
Given
that
hybridization
is
the
focus
of
this
study,
we
will
consider
this
latter
criterion
(sources
of
information)
in
order
to
distinguish
MT
paradigms.
Within
this
category,
we
detail
a
wide
variety
of
hybridization
approaches.
2See
references
in
http://www.statmt.org/survey/Topic/SystemCombination.
M.R.
Costa-jussà,
J.A.R.
Fonollosa
/
Computer
Speech
and
Language
32
(2015)
3–10
5
Fig.
1.
Classification
of
hybrid
MT
architectures.
Fig.
2.
Schema
of
hybridization
guided
by
RBMT.
3.
Hybridization
of
machine
translation
architectures
Several
different
methodologies
have
been
used
to
hybridize
MT
within
and
across
paradigms.
As
shown
in
Fig.
1,
hybridization
of
RBMT
and
corpus-based
MT
can
be
classified
into
those
guided
by
RBMT
or
guided
by
corpus-based
MT.
The
former
integrates
data
information
into
a
rule-based
architecture;
the
latter
integrates
linguistic
rules
into
a
corpus-based
architecture.
3.1.
Hybridization
guided
by
RBMT
There
are
several
kinds
of
strategy
within
this
category:
introducing
a
corpus
to
build
the
RBMT
system,
introducing
corpus-based
tools
to
weight
the
RBMT
output
and
carrying
out
a
statistical
post-editing
of
a
RBMT
output.
Using
a
corpus
to
build
the
RBMT
system.
The
main
reason
for
using
data
when
building
a
RBMT
system
is
to
reduce
its
cost
and
the
time
and
effort
required.
A
quite
straightforward
approach
is
to
enhance
dictionaries
with
phrases
(Habash
et
al.,
2009)
or
examples
(Sánchez-Martí
nez
et
al.,
2009;
Antonova
and
Misyurev,
2014)
extracted
from
parallel
corpora,
and
extract
new
entries
from
BabelNet
and
Wiktionary
(Göhring,
2014).
More
complex
approaches
extract
transfer
rules
(Sánchez-Martínez
and
Forcada,
2009),
build
lexical
selection
modules
using
parallel
corpora
with
finite-state
transducers
(Tyers
et
al.,
2012)
or
Maximum
Entropy
Markov
Models
(Rudnick
and
Gasser,
2013),
and
combine
several
of
these
techniques
(Costa-jussà
and
Centelles,
2015).
Corpus-based
tools
for
weighting
the
RBMT
output.
There
is
work
that
focuses
on
improving
the
RBMT
output
by
integrating
tools
such
as
language
models
(Dove
et
al.,
2012)
or
stochastic
parsers
(Federmann
and
Hunsicker,
2011).
Papers
like
Labaka
et
al.
(2014)
show
a
hybrid
translation
system
guided
by
the
RBMT
engine
and,
before
transference,
a
set
of
partial
candidate
translations
provided
by
SMT
subsystems
is
used
to
enrich
the
tree-based
representation.
The
final
hybrid
translation
is
created
by
choosing
the
most
probable
combination
among
the
available
fragments
with
a
statistical
decoder
in
a
monotonic
way
(see
Fig.
2).
In
addition,
there
are
RBMT
systems
that
introduce
machine
learning
techniques
such
as
classifiers
in
order
to
identify
the
set
of
appropriate
translation
candidates
(Hunsicker
et
al.,
2012).
Recent
experiments
by
Systran
build
a
statistical
inference
module
to
replace
the
RBMT
transfer
module
(Crego,
2014)
and
experiments
by
Lingenio
show
that
RBMT
systems
can
learn
morphological
classification,
semantic
and
syntactic
information
from
corpus
data
(Eberle,
2014).
Statistical
post-editing
of
RBMT
outputs.
There
are
studies
that
carry
out
statistical
post-editing
for
RBMT
systems
(Simard
et
al.,
2007;
Lagarda
et
al.,
2009)
and
it
is
even
a
commercial
reality3as
pointed
out
in
Béchara
et
al.
(2012).
Generally
speaking,
these
approaches
consider
RBMT
outputs
as
source
sentences
and
post-edited
results
as
target
3http://www.systran.co.uk/translation-products/server/systran-enterprise-server.
6
M.R.
Costa-jussà,
J.A.R.
Fonollosa
/
Computer
Speech
and
Language
32
(2015)
3–10
Fig.
3.
Schema
of
hybridization
guided
by
corpus-based
MT.
sentences.
In
other
cases,
Suzuki
(2011)
confidence
estimation
measures
are
used
instead
of
manually
post-edited
results.
The
statistical
module
tends
to
be
implemented
with
Moses
(Koehn
et
al.,
2007).
In
this
case,
RBMT
and
SMT
paradigms
are
concatenated
but
not
integrated
at
the
architecture
level.
3.2.
Hybridization
guided
by
corpus-based
MT
A
hybrid
system
guided
by
corpus-based
MT
may
incorporate
rules
or
just
combine
various
corpus-based
MT
approaches.
There
are
basically
two
main
ways
of
integrating
rules
into
corpus-based
MT
approaches:
using
rules
at
pre/post-processing,
and
integrating
dictionaries/rules
into
the
core
model.
Rules
at
pre/post-processing.
Pre-processing
rules
have
been
used
to
reorder
the
source
sentence
into
a
form
that
better
matches
the
target
language
(Xia
and
McCord,
2004;
Collins
et
al.,
2005;
Wang
et
al.,
2007;
Patel
et
al.,
2013).
The
schema
for
this
type
of
strategy
is
shown
in
Fig.
3.
Post-processing
rules
for
morphology
generation
have
been
introduced
by
means
of
a
combination
of
machine
learning
and
the
introduction
of
dictionaries
(Formiga
et
al.,
2012).
Finally,
a
set
of
both
pre-processing
and
post-processing
rules
have
been
compiled
ad-hoc
for
the
Spanish-Catalan
translation
pair
in
Farrús
et
al.
(2011),
in
order
to
solve
the
normalization
problems
typically
found
in
noisy
corpora.
Incorporating
dictionaries/rules
into
the
core
model.
Rules
may
be
integrated
into
the
core
model
of
corpus-based
MT
approaches.
Early
work
such
as
Carl
et
al.
(2000)
integrates
morphology
and
syntax
knowledge
from
the
RBMT
system
dynamically
into
an
EBMT
system.
In
other
cases,
RBMT
systems
have
been
integrated
into
the
phrase-based
SMT
modules.
For
example,
Hua
and
Haifeng
(2004)
use
RBMT
information
to
improve
statistical
word
alignment.
Then,
Eisele
et
al.
(2008)
augment
the
standard
phrase
table
with
entries
obtained
after
translating
the
data
with
several
RBMT
systems.
The
resulting
phrase
table
thus
combines
statistically
gathered
phrase
pairs
with
phrase
pairs
generated
by
linguistic
rules.
Similarly,
Sánchez-Cartagena
et
al.
(2011)
enrich
the
phrase
table
with
bilingual
phrase
pairs
matching
transfer
rules
and
dictionary
entries
from
a
shallow-transfer
RBMT
system,
and
carrying
out
a
comparison
with
an
earlier
paper
(Eisele
et
al.,
2008).
Further
work
by
these
latter
authors
(Chen
and
Eisele,
2010)
integrates
a
commercial
RBMT
system
with
a
hierarchical
SMT
system
by
extracting
rules
from
RBMT
translations.
The
hybrid
system
inherits
the
lexicons
from
both
sub-systems
as
well
as
local
syntactic
constructions
defined
in
RBMT.
From
a
different
perspective,
Ahsan
et
al.
(2010)
focus
on
integrating
local
and
long
reorderings
as
well
as
the
generation
module
from
an
RBMT
system,
into
the
core
translation
model
of
a
standard
statistical
system.
Furthermore
Enache
et
al.
(2012)
introduce
rules
from
a
grammar
formalism
into
the
phrase
table,
and
Okuma
et
al.
(2008)
introduce
dictionaries
into
the
phrase
table
to
reduce
the
number
of
unknown
words.
Hybridization
within
corpus-based
approaches.
When
combining
corpus-based
approaches,
Groves
and
Way
(2005)
mix
sub-sentential
alignments
from
phrase-based
SMT
and
EBMT
systems,
proposing
to
build
a
hybrid
‘example-
based’
SMT
system
incorporating
marker
chunks
and
SMT
sub-sentential
alignments.
There
is
an
extensive
body
of
work
on
incorporating
translation
memories
(TM)
into
phrase-based
SMT
systems.
TM
are
simply
large
databases
of
translated
words
and
sequences
of
words,
generally
created
by
human
translators.
One
of
the
most
recent
studies
(Wang
et
al.,
2013)
proposes
integrated
models
to
make
maximum
use
of
TM
information
during
decoding.
The
aim
is
to
keep
all
its
possible
corresponding
target
phrases
for
each
TM
source
phrase.
The
integrated
models
then
consider
all
corresponding
TM
target
phrases
and
SMT
preferences
during
decoding.
Therefore,
the
proposed
integrated
models
combine
SMT
and
TM
at
a
deep
level.
A
traditional
way
that
cannot
be
neglected
is
the
use
of
templates
(Och
and
Ney,
2004),
which
themselves
can
be
considered
to
be
stochastically-extracted
transduction
type
rules.
There
are
also
approaches
that
combine
n-gram
and
phrase
SMT
in
series
(Costa-jussà
and
Fonollosa,
2010).
The
former
pre-reorders
the
source
sentences
and
offers
a
reordering
graph
that
the
latter
translates
using
monotonic
decoding.
M.R.
Costa-jussà,
J.A.R.
Fonollosa
/
Computer
Speech
and
Language
32
(2015)
3–10
7
Table
1
Hybrid
MT
architectures,
added
information
and
the
corresponding
most
representative
references.
Guided
by
Information
References
RBMT
Corpus
to
build Habash
et
al.
(2009),
Sánchez-Martí
nez
et
al.
(2009),
Antonova
and
Misyurev
(2014),
Göhring
(2014)
Sánchez-Martínez
and
Forcada
(2009),
Tyers
et
al.
(2012),
Rudnick
and
Gasser
(2013),
Costa-jussà
and
Centelles
(2015)
Corpus
to
weight
outputs
Federmann
and
Hunsicker
(2011),
Dove
et
al.
(2012),
Hunsicker
et
al.
(2012),
Labaka
et
al.
(2014),
Crego
(2014),
Eberle
(2014)
Statistical
post-editing
Simard
et
al.
(2007),
Lagarda
et
al.
(2009)
Suzuki
(2011),
Béchara
et
al.
(2012)
Corpus-based
MT
Rules
at
pre/post-processing
Xia
and
McCord
(2004),
Collins
et
al.
(2005),
Wang
et
al.
(2007),
Patel
et
al.
(2013),
Farrús
et
al.
(2011),
Formiga
et
al.
(2012)
Dictionaries/rules
into
the
core
model
Carl
et
al.
(2000),
Hua
and
Haifeng
(2004),
Okuma
et
al.
(2008),
Eisele
et
al.
(2008),
Sánchez-Cartagena
et
al.
(2011)
Chen
and
Eisele
(2010),
Ahsan
et
al.
(2010),
Enache
et
al.
(2012)
Only
corpus Och
and
Ney
(2004),
Groves
and
Way
(2005),
Wang
et
al.
(2013)
Carbonell
et
al.
(2006),
Carl
et
al.
(2008),
Costa-jussà
and
Fonollosa
(2010),
Tambouratzis
et
al.
(2013)
Finally,
there
are
approaches
that
are
exempt
from
the
requirement
for
parallel
corpora
or
resources
in
general.
There
is
an
MT
method
that
needs
no
parallel
text
and
relies
on
a
translation
model
built
from
a
bilingual
dictionary,
and
a
decoder
for
long-range
context
(Carbonell
et
al.,
2006).
In
the
same
direction,
other
systems
use
low
resources
(Carl
et
al.,
2008)
and
a
methodology
designed
to
facilitate
rapid
creation
of
the
MT
system
for
unconstrained
language
pairs
(Tambouratzis
et
al.,
2013)
(Table
1).
4.
Machine
translation
applications
with
hybrid
components
Among
the
variety
of
MT
applications,
we
can
name
popular
ones
such
as
speech
translation,
cross-lingual
informa-
tion
retrieval
and
computer-aided
translation.
Hybridization
within
these
applications
has
been
used
in
different
ways
and
we
offer
comments
some
of
them
without
aiming
to
be
exhaustive.
See
Fig.
2
for
a
short
summary
of
references.
Speech
translation.
Frequently,
speech
translation
is
addressed
as
a
concatenation
of
a
speech
recognizer,
a
machine
translator
and
a
speech
synthesizer.
Hybridization
in
this
application
can
be
placed
in
any
of
the
three
systems.
In
speech
recognition,
hybridization
has
been
done
by
incorporating
neural
network
approaches
into
state-of-the-art
continuous
speech
recognition
systems
based
on
hidden
Markov
models
(HMMs)
(Bourlard
and
Morgan,
1993).
There
is
also
the
combination
of
hidden
Markov
models
(HMMs)
and
learning
vector
quantization
(LVQ)
(Katagiri
and
Lee,
1993),
or
the
use
of
Support
Vector
Machines
(SVMs)
for
classification
by
integrating
this
method
into
a
HMM-based
speech
recognition
system
(Ganapathiraju,
2002).
In
text
synthesis,
the
hybridization
has
been
done
by
combining
concatenative
synthesis
and
statistical
synthesis
(Tiomkin
et
al.,
2011).
Cross-lingual
information
retrieval.
Normally,
the
application
of
cross-lingual
information
retrieval
is
done
by
concatenating
MT
and
information
retrieval.
For
example,
Mittal
et
al.
(2010)
present
a
hybrid
information
system
combining:
(1)
an
ontology
for
the
retrieval
of
user’s
context
(2)
a
user
profile
that
is
temporarily
updated
according
to
user’s
browsing
behavior
and
(3)
collaborative
filtering
for
considering
recommendations
of
similar
users.
Elsewhere,
Rose
and
Belew
(1989)
use
a
combination
of
symbolic
and
connectionist
artificial
intelligence
techniques.
Table
2
Hybrid
MT
applications.
MT
applications
References
Speech
translation
Bourlard
and
Morgan
(1993),
Katagiri
and
Lee
(1993),
Ganapathiraju
(2002),
Tiomkin
et
al.
(2011)
Cross-lingual
information
retrieval
Mittal
et
al.
(2010),
Rose
and
Belew
(1989)
Computer-aided
translation
Wong
et
al.
(2012),
Yamabana
et
al.
(1997),
Federico
et
al.
(2014)
8
M.R.
Costa-jussà,
J.A.R.
Fonollosa
/
Computer
Speech
and
Language
32
(2015)
3–10
Computer-aided
translation.
Finally,
computer-aided
translation
is
by
definition
a
combination
of
the
roles
of
both
man
and
machine.
Recent
work,
like
(Wong
et
al.,
2012),
uses
a
machine-aided
translation
system,
which
is
a
hybrid
system
that
applies
not
only
TM
technology
but
also
MT
methodologies,
including
the
annotation
schema
of
Translation
Corresponding
Tree
(TCT)
in
the
representation
of
bilingual
examples,
and
the
language
formalism
of
Constraint-based
Synchronous
Grammar
(CSG)
in
analyzing
the
syntactic
structure
between
the
languages;
Yamabana
et
al.
(1997)
also
propose
a
hybrid
interactive
MT
method
that
combines
rule
and
example-based
MT
approaches
with
an
interactive
man-machine
interface.
Advanced
work
in
the
field
such
as
(Federico
et
al.,
2014),
includes
approaches
to
incremental
training
or
active
learning
which
are
representative
of
live
human-machine
hybridization
where
the
MT
system
learns
and
improves
based
on
human
interaction.
5.
Conclusions
This
survey
reported
an
overview
of
several
relevant
works
on
hybrid
MT
which
combine
different
MT
architectures
to
provide
better
translation
quality.
Combinations
aim
at
extracting
the
best
features
of
each
paradigm
and
solving
the
problems
of
pure
architectures.
That
is
why
hybrid
MT
has
helped
to
advance
the
field
and
is
a
promising
line
of
research.
This
paper
provides
a
structured
classification
that
can
cover
the
majority
of
research
on
hybrid
MT.
The
classification
is
based
on
the
fact
that
combinations
of
MT
approaches
are
normally
guided
by
a
core
system
which
can
be
either
rule
or
corpus-based.
Most
of
the
research
combines
sources
of
information
(rules
and
data),
but
there
are
also
projects
combining
various
corpus-based
approaches.
It
is
difficult
to
assess
which
is
the
most
relevant
or
promising
hybrid
type
of
architecture,
but
it
would
seem
reasonable
to
use
the
best-performing
system
as
a
guide,
and
the
others
for
additional
information.
The
good
results
produced
by
hybridization
have
lead
to
a
corresponding
spread
of
MT
applications
such
as
speech
translation,
cross-language
information
retrieval,
computer-aided
and
post-edited
MT
systems.
Work
with
hybrid
strategies
in
both
in
MT
and
its
applications
brings
significant
improvement
because
they
allow
the
simultaneous
exploitation
of
a
variety
of
systems.
Acknowledgements
The
authors
would
like
to
express
particular
thanks
to
Declan
Groves
for
his
valuable
comments
while
reviewing
this
paper.
This
paper
has
been
partially
supported
by
the
Spanish
Ministerio
de
Economía
y
Competitividad,
contract
TEC2012-
38939-C03-02,
as
well
as
by
the
European
Regional
Development
Fund
(ERDF/FEDER),
the
Seventh
Framework
Program
of
the
European
Commission
through
the
International
Outgoing
Fellowship,
Marie
Curie
Action
(IMTraP-
2011-29951),
and
the
AGAUR
under
the
MOOCs
2013
contract.
References
Ahsan,
A.,
Kolachina,
P.,
Kolachina,
S.,
Sharma,
D.M.,
Sangal,
R.,
2010.
Coupling
statistical
machine
translation
with
rule-based
transfer
and
generation.
In:
Proceedings
of
the
9th
Conference
of
the
Association
for
Machine
Translation
in
the
Americas.
Antonova,
A.,
Misyurev,
A.,
2014.
Improving
the
precision
of
automatically
constructed
human-oriented
translation
dictionaries.
In:
Proceedings
of
the
3rd
Workshop
on
Hybrid
Approaches
to
Machine
Translation
(HyTra),
pp.
58–66.
Béchara,
H.,
Rubino,
R.,
He,
Y. ,
Ma,
Y. ,
Genabith,
J.,
2012.
An
evaluation
of
statistical
post-editing
systems
applied
to
RBMT
and
SMT
systems.
In:
Proceedings
of
International
Conference
on
Computational
Linguistics(COLING),
pp.
215–230.
Bourlard,
H.A.,
Morgan,
N.,
1993.
Connectionist
Speech
Recognition:
A
Hybrid
Approach.
Kluwer
Academic
Publishers,
Norwell,
MA,
USA.
Brown,
P.F. ,
Pietra,
V.J.D.,
Pietra,
S.A.D.,
Mercer,
R.L.,
1993.
The
mathematics
of
statistical
machine
translation:
parameter
estimation.
Comput.
Linguist.
19
(June
(2)),
263–311.
Carbonell,
J.G.,
Klein,
S.,
Miller,
D.,
Steinbaum,
M.,
Grassiany,
T.,
Frey,
J.,
2006.
Context-based
machine
translation.
In:
Proceedings
of
the
Conference
of
the
Association
for
Machine
Translation
in
the
Americas
(AMTA).
Carl,
M.,
Melero,
M.,
Badia,
T.,
Vandeghinste,
V. ,
Dirix,
P.,
Schuurman,
I.,
Markantonatou,
S.,
Sofianopoulos,
S.,
Vassiliou,
M.,
Yannoutsou,
O.,
2008.
METIS-II:
low
resource
machine
translation.
Mach.
Transl.
22
(1–2),
67–99.
Carl,
M.,
Pease,
C.,
Iomdin,
L.,
Streiter,
O.,
2000.
Towards
a
dynamic
linkage
of
example-based
and
rule-based
machine
translation.
Mach.
Transl.
15
(3),
223–257.
M.R.
Costa-jussà,
J.A.R.
Fonollosa
/
Computer
Speech
and
Language
32
(2015)
3–10
9
Chen,
Y. ,
Eisele,
A.,
2010.
Integrating
a
rule-based
with
a
hierarchical
translation
system.
In:
Proceedings
of
the
Seventh
conference
on
International
Language
Resources
and
Evaluation
(LREC).
Collins,
M.,
Koehn,
P.,
Kuˇ
cerová,
I.,
2005.
Clause
restructuring
for
statistical
machine
translation.
In:
Proceedings
of
the
ACL,
Ann
Arbor,
pp.
531–540.
Costa-jussà,
M.R.,
Centelles,
J.,
2015.
Description
of
the
Chinese-to-Spanish
rule-based
machine
translation
system
developed
with
a
hybrid
combination
of
human
annotation
and
statistical
techniques.
ACM
Trans.
Asian
Lang.
Inf.
Process.
(submitted
for
publication).
Costa-jussà,
M.R.,
Farrús,
M.,
2014.
Statistical
machine
translation
enhancements
through
linguistic
levels:
a
survey.
ACM
Comput.
Surv.
46
(3),
42.
Costa-jussà,
M.R.,
Fonollosa,
J.A.R.,
2010.
Using
linear
interpolation
and
weighted
reordering
hypotheses
in
the
Moses
system.
In:
Proceedings
of
the
Seventh
conference
on
International
Language
Resources
and
Evaluation
(LREC),
European
Languages
Resources
Association
(ELRA).
Crego,
J.M.,
2014
April.
SYSTRAN
RBMT
engine:
hybridization
experiments.
In:
Talk
at
3rd
Workshop
on
Hybrid
Approaches
to
Machine
Translation
(HyTra),
Gothenburg,
Sweden.
Dove,
C.,
Loskutova,
O.,
de
la
Fuente,
R.,
2012.
What’s
your
pick:
RbMT,
SMT
or
hybrid?
In:
Proceedings
of
11th
Conference
of
the
Association
for
Machine
Translation
in
the
Americas
(AMTA),
San
Diego.
Eberle,
K.,
2014.
Hybrid
Strategies
for
better
products
and
shorter
time-to-market.
In:
Talk
at
3rd
Workshop
on
Hybrid
Approaches
to
Machine
Translation
(HyTra),
April,
Gothenburg,
Sweden.
Eisele,
A.,
Federmann,
C.,
Saint-Amand,
H.,
Jellinghaus,
M.,
Herrmann,
T.,
Chen,
Y. ,
2008.
Using
Moses
to
integrate
multiple
rule-based
machine
translation
engines
into
a
hybrid
system.
In:
Proceedings
of
the
3rd
Workshop
on
Statistical
Machine
Translation
(WMT),
pp.
179–182.
Enache,
R.,
Espa˜
na-Bonet,
C.,
Ranta,
A.,
Màrquez,
L.,
2012.
A
hybrid
system
for
patent
translation.
In:
Proceedings
of
the
16th
Annual
Conference
of
the
European
Association
for
Machine
Translation
(EAMT),
May,
Trento,
Italy,
pp.
269–276.
Farrús,
M.,
Costa-jussà,
M.R.,
Mari
no,
J.,
Poch,
M.,
Hernández,
A.,
Henríquez,
C.,
Fonollosa,
J.,
2011.
Overcoming
statistical
machine
translation
limitations:
error
analysis
and
proposed
solutions
for
the
Catalan-Spanish
language
pair.
Lang.
Resour.
Eval.
45
(2),
181–208.
Federico,
M.,
Bertoldi,
N.,
Cettolo,
M.,
Negri,
M.,
Turchi,
M.,
Trombetti,
M.,
Cattelan,
A.,
Farina,
A.,
Lupinetti,
D.,
Martines,
A.,
Massidda,
A.,
Schwenk,
H.,
Barrault,
L.,
Blain,
F.,
Koehn,
P.,
Buck,
C.,
Germann,
U.,
2014.
The
MATEC,
tool.
In:
Proceedings
of
COLING
2014,
the
25th
International
Conference
on
Computational
Linguistics:
System
Demonstrations,
August,
Dublin,
Ireland,
pp.
129–132.
Federmann,
C.,
Hunsicker,
S.,
2011.
Stochastic
parse
tree
selection
for
an
existing
RBMT
system.
In:
Proceedings
of
the
6th
Workshop
on
Statistical
Machine
Translation
(WMT),
pp.
351–357.
Formiga,
L.,
Hernández,
A.,
Mari
no,
J.B.,
Monte,
E.,
2012.
Improving
English
to
Spanish
out-of-domain
translations
by
morphology
generalization
and
generation.
In:
AMTA
Workshop
on
Monolingual
Machine
Translation.
Ganapathiraju,
A.,
2002.
Support
Vector
Machines
for
Speech
Recognition
(PhD
thesis).
Mississippi
State,
MS,
USA.
Göhring,
A.,
2014.
Building
a
Spanish-German
dictionary
for
hybrid
MT.
In:
Proceedings
of
the
3rd
Workshop
on
Hybrid
Approaches
to
Machine
Translation
(HyTra),
pp.
30–35.
Groves,
D.,
Way,
A.,
2005
Dec.
Hybrid
data-driven
models
of
machine
translation.
Mach.
Transl.
19
(3–4),
301–323.
Habash,
N.,
Dorr,
B.,
Monz,
C.,
2009.
Symbolic-to-statistical
hybridization:
extending
generation-heavy
machine
translation.
Mach.
Transl.
23
(1),
23–63.
Hua,
W.,
Haifeng,
W.,2004.
Improving
statistical
word
alignment
with
a
rule-based
machine
translation
system.
In:
Proceedings
of
the
20th
International
Conference
on
Computational
Linguistics.
Association
for
Computational
Linguistics,
p.
29.
Hunsicker,
S.,
Yu,
C.,
Federmann,
C.,
2012.
Machine
learning
for
hybrid
machine
translation.
In:
Proceedings
of
the
7th
Workshop
on
Statistical
Machine
Translation
(WMT),
June,
pp.
312–316.
Katagiri,
S.,
Lee,
C.-H.,
1993.
A
new
hybrid
algorithm
for
speech
recognition
based
on
HMM
segmentation
and
learning
vector
quantization.
IEEE
Trans.
Speech
Audio
Process.
1
(October
(4)),
421–430.
Koehn,
P.,
Hoang,
H.,
Birch,
A.,
Callison-Burch,
C.,
Federico,
M.,
Bertoldi,
N.,
Cowan,
B.,
Shen,
W.,
Moran,
C.,
Zens,
R.,
Dyer,
C.,
Bojar,
O.,
Constantin,
A.,
Herbst,
E.,
2007.
Moses:
open
source
toolkit
for
statistical
machine
translation.
In:
Proceedings
of
the
45th
Annual
Meeting
of
the
ACL
on
Interactive
Poster
and
Demonstration
Sessions,
pp.
177–180.
Labaka,
G.,
Espa
na-Bonet,
C.,
Màrquez,
L.,
Sarasola,
K.,
2014.
A
hybrid
machine
translation
architecture
guided
by
Syntax.
Mach.
Transl.
28,
1–35.
Lagarda,
A.-L.,
Alabau,
V. ,
Casacuberta,
F.,
Silva,
R.,
Diaz-de
Liano,
E.,
2009.
Statistical
post-editing
of
a
rule-based
machine
translation
system.
In:
Proceedings
of
Human
Language
Technologies:
The
2009
Annual
Conference
of
the
North
American
Chapter
of
the
Association
for
Computational
Linguistics,
Companion
Volume:
Short
Papers,
pp.
217–220.
Mittal,
N.,
Nayak,
R.,
Govil,
M.C.,
Jain,
K.C.,
2010.
Evaluation
of
a
hybrid
approach
of
personalized
web
information
retrieval
using
the
FIRE
data
set.
In:
Proceedings
of
the
1st
Amrita
ACM- W
Celebration
on
Women
in
Computing
in
India,
New
York,
NY,
USA,
pp.
52:1–52:6.
Och,
F.J.,
Ney,
H.,
2004.
The
alignment
template
approach
to
statistical
machine
translation.
Comput.
Linguist.
30
(December
(4)),
417–449.
Okuma,
H.,
Yamamoto,
H.,
Sumita,
E.,
2008.
Introducing
a
translation
dictionary
into
phrase-based
SMT.
IEICE
Trans.
Inf.
Syst.
E91-D
(July
(7)),
2051–2057.
Patel,
R.N.,
Gupta,
R.,
Pimpale,
Prakash,
B.,
an
d,
M.S.,
2013.
Reordering
rules
for
English-Hindi
SMT.
In:
Proceedings
of
the
2nd
Workshop
on
Hybrid
Approaches
to
Translation
(HyTra),
August,
Sofia,
Bulgaria,
pp.
34–41.
Rose,
D.E.,
Belew,
R.K.,
1989.
Legal
information
retrieval
a
hybrid
approach.
In:
Proceedings
of
the
2nd
International
Conference
on
Artificial
Intelligence
and
Law,
ICAIL’89,
ACM,
New
York,
NY,
USA,
pp.
138–146.
Rudnick,
A.,
Gasser,
M.,
2013.
Lexical
selection
for
hybrid
MT
with
sequence
labeling.
In:
Proceedings
of
the
2nd
Workshop
on
Hybrid
Approaches
to
Translation
(HyTra),
August,
Sofia,
Bulgaria,
pp.
102–108.
Sánchez-Cartagena,
V.M.,
Sánchez
Martí
nez,
F.,
Pérez
Ortiz,
J.A.,
et
al.,
2011.
Integrating
shallow-transfer
rules
into
phrase-based
statistical
machine
translation.
In:
Machine
Translation
Summit.
10
M.R.
Costa-jussà,
J.A.R.
Fonollosa
/
Computer
Speech
and
Language
32
(2015)
3–10
Sánchez-Martínez,
F.,
Forcada,
M.L.,
2009.
Inferring
shallow-transfer
machine
translation
rules
from
small
parallel
corpora.
J.
Artif.
Intell.
Res.
34,
605–635.
Sánchez-Martí
nez,
F.,
Forcada,
M.L.,
Way,
A.,
2009.
Hybrid
rule-based-example-based
MT:
feeding
Apertium
with
sub-sentential
translation
units.
In:
Proceedings
of
the
3rd
Workshop
on
Example-Based
Machine
Translation,
Dublin,
pp.
11–18.
Simard,
M.,
Ueffing,
N.,
Isabelle,
P.,
Kuhn,
R.,
2007.
Rule-based
translation
with
statistical
phrase-based
post-editing.
In:
Proceedings
of
the
2nd
Workshop
on
Statistical
Machine
Translation
(WMT),
pp.
203–206.
Suzuki,
H.,
2011.
Automatic
post-editing
based
on
SMT
and
its
selective
application
by
sentence-level
automatic
quality
evaluation.
In:
Proceedings
of
the
13th
Machine
Translation
Summit
(MT
Summit
XIII),
International
Association
for
Machine
Translation,
pp.
156–163.
Tambouratzis,
G.,
Sofianopoulos,
S.,
Vassiliou,
M.,
2013.
Language-independent
hybrid
MT
with
PRESEMT.
In:
Proceedings
of
the
2nd
Workshop
on
Hybrid
Approaches
to
Translation
(HyTra),
August,
Sofia,
Bulgaria,
pp.
123–130.
Tiomkin,
S.,
Malah,
D.,
Shechtman,
S.,
Kons,
Z.,
2011.
A
hybrid
text-to-speech
system
that
combines
concatenative
and
statistical
synthesis
units.
IEEE
Trans.
Audio
Speech
Lang.
Process.
19
(July
(5)),
1278–1288.
Tyers,
F.M.,
Sánchez-Martínez,
F.,
Forcada,
M.L.,
2012.
Flexible
finite-state
lexical
selection
for
rule-based
machine
translation.
In:
Proceedings
of
the
16th
Conference
of
the
European
Association
for
Machine
Translation
(EAMT),
May,
Trento,
Italy,
pp.
213–220.
Wang,
C.,
Collins,
M.,
Koehn,
P.,
2007.
Chinese
syntactic
reordering
for
statistical
machine
translation.
In:
Proceedings
of
the
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning,
pp.
737–745.
Wang,
K.,
Zong,
C.,
Su,
K.-Y.,
2013.
Integrating
translation
memory
into
phrase-based
machine
translation
during
decoding.
In:
Proceedings
of
the
51st
Annual
Meeting
of
the
Association
for
Computational
Linguistics,
August,
Sofia,
Bulgaria,
pp.
11–21.
Wong,
F.,
Oliveira,
F.,
Li,
Y. ,
2012.
Hybrid
machine
aided
translation
system
based
on
constraint
synchronous
grammar
and
translation
corresponding
tree.
J.
Comput.
7
(February
(2)),
309–316.
Wu,
D.,
2005.
MT
model
space:
statistical
versus
compositional
versus
example-based
machine
translation.
Mach.
Transl.
19
(December
(3–4)),
213–227.
Xia,
F.,
McCord,
M.,
2004.
Improving
a
statistical
MT
system
with
automatically
learned
rewrite
patterns.
In:
Proceedings
of
the
20th
International
Conference
on
Computational
Linguistics
(COLING).
Yamabana,
K.,
Kamei,
S.-i.,
Muraki,
K.,
Doi,
S.,
Tamura,
S.,
Satoh,
K.,
1997.
A
hybrid
approach
to
interactive
machine
translation:
integrating
rule-
based,
corpus-based,
and
example-based
method.
In:
Proceedings
of
the
15th
International
Joint
Conference
on
Artifical
Intelligence,
IJCAI’97,
vol.
2,
San
Francisco,
CA,
USA,
pp.
977–982.
... Researchers have used a variety of approaches in developing machine translation, which could be grouped into three categories: rules-based machine translation (RBMT), statistical machine translation (SMT), and neural machine translation (NMT) (Costa-Jussa and Fonollosa, 2015). Rule-based systems use expert-developed linguistic knowledge, such as grammar and language rules. ...
... Over the last decades, researchers have successfully combined rule-based machine translation (RBMT) and SMT techniques to develop more powerful hybrid machine translations (Alshawi et al., 1998;Yamada and Knight, 2001). And the appearance of websites for automatic translation like Google in 2006 Costa-Jussa and Fonollosa, 2015). ...
Thesis
In this study, we developed a NMT model from English to Igbo, an African language spoken by over 40 million people in Nigeria and across west Africa. We used the standard benchmark dataset collected from bible corpora, local news, Wikipedia articles, and common crawl verified by language experts. RNN-based architectures, including LSTM and GRU, were employed in conjunction with the attention mechanism in our proposed solution. The translation quality exhibited by our model was found to be comparable to the current state-of-the-art benchmark for English-Igbo translation. And, by leveraging Transfer Learning techniques, our model outperformed the English-Igbo state-of-the-art benchmark by up to $+$4.83 BLEU points, achieving a translation quality of 70$\%$. This achievement is particularly significant in the context of low-resource translations.
... The focus will be on utilizing each person's distinct strengths the accuracy and effectiveness of AI algorithms combined with the subtle understanding and originality of human translators. For instance, according to Costa-Jussa and Fonollosa (2015), RBMT is a sort of hybridization that incorporates corpus-based data into an RBMT system while HMT that is guided by SMT incorporates linguistic principles as well as hybrid SMT that incorporates various SMT methodologies, such as phrase-based SMT and n-gram SMT. Various applications, including commercial MT systems, governmental translation systems, and research initiatives, use hybrid MT. ...
Article
Full-text available
This review paper provides an overview of the use of artificial intelligence (AI) in Translation Studies (TS), covering statistical machine translation, rule-based machine translation, neural machine translation, and hybrid machine translation. It explores the advantages and limitations of each model, as well as their applications in translation. Additionally, it discusses various techniques for evaluating the effectiveness of AI models in translation, along with their advantages and limitations, such as handling figurative language (e.g., idioms, metaphors) and cultural nuances. The review also delves into research directions for improving AI-based translation, elaborates on the ethical and social implications of AI in translation, and discusses the representation of AI in other disciplines such as literature and arts. Finally, the impact of AI as well as the opportunities and challenges that it could create for translators, such as professional challenges, data privacy, bias, and fairness matters were briefly discussed. By summarizing the main findings, and lessons learnt in AI-based translation, some recommendations regarding the current and future direction of using AI in translation were formulated.
... The recent rise in the number of academics as well as people outside the academia that have become part of the translation and interpretation community has led to an increase in studies in this area in aspects such as cognitive processes (Angelone & Marín, 2022;Christoffels et al., 2006;Mellinger, 2022), strategy use (Abdelaal, 2019;, and machine learning (Castilho et al., 2018;Costa-jussà, & Fonollosa, 2015). Surprisingly, an understudied area in this regard has been the role of language proficiency in translation and interpretation. ...
Article
Full-text available
Visionary teaching interventions have had a positive impact on developing and strengthening students’ ideal L2 self and motivated behavior. However, research on the effects of this kind of intervention on the motivation of Translation and Interpretation students is scarce. Using a mixed methods approach, we have evaluated the impact of a semester-long intervention focused on Translation and Interpretation students’ future professional careers on their motivation, intended effort, and willingness to communicate. A questionnaire was used to estimate ideal L2 self, ought-to L2 self, learning attitudes, intended effort, ease of using imagery, and willingness to communicate in writing. Additionally, we used a semi-structured interview to explore in further depth the students’ perceptions of the experience. The results of this study reveal that visionary teaching increased both Ideal l2 self and Intended effort of students. Additionally, the analysis of the semi-structured interview data showed that the intervention was memorable for students and that it benefited them in establishing a future L2 professional vision as well as outline the steps to achieve it. Our findings strengthen the importance of including visionary teaching in translation and interpretation programs, so that students can become motivated and involved in their future professional paths.
... Deep learning-based artificial intelligence has gained tremendous attention in recent years. Deep learning models are being used in a wide range of applications [1][2][3], such as computer vision [4,5], machine translation [6], natural language processing [7], and recommender systems [8]. In addition, deep learning techniques have achieved great success in real-time applications such as self-driving cars [9,10], unmanned aerial vehicles (UAVs) [11], and autonomous robots [12][13][14]. ...
Article
Full-text available
Deep learning is employed in many applications, such as computer vision, natural language processing, robotics, and recommender systems. Large and complex neural networks lead to high accuracy; however, they adversely affect many aspects of deep learning performance, such as training time, latency, throughput, energy consumption, and memory usage in the training and inference stages. To solve these challenges, various optimization techniques and frameworks have been developed for the efficient performance of deep learning models in the training and inference stages. Although optimization techniques such as quantization have been studied thoroughly in the past, less work has been done to study the performance of frameworks that provide quantization techniques. In this paper, we have used different performance metrics to study the performance of various quantization frameworks, including TensorFlow automatic mixed precision and TensorRT. These performance metrics include training time and memory utilization in the training stage along with latency and throughput for graphics processing units (GPUs) in the inference stage. We have applied the automatic mixed precision (AMP) technique during the training stage using the TensorFlow framework, while for inference we have utilized the TensorRT framework for the post-training quantization technique using the TensorFlow TensorRT (TF-TRT) application programming interface (API).We performed model profiling for different deep learning models, datasets, image sizes, and batch sizes for both the training and inference stages, the results of which can help developers and researchers to devise and deploy efficient deep learning models for GPUs.
... Focusing on MT modules, Alqudsi et al. [33] reviewed MT techniques used to translate the Arabic Language into English. Costa-Jussa et al. [34] focused on the rule-based structure (hybridization) and its applications, whereas Gaspari et al. [35] considered MT quality assessment and post-editing aspects. Khan et al. [36] focused on the performance of phrase-based Statistical Machine translation (SMT) in multiple Indian languages. ...
... It is a multidisciplinary field with approaches from both linguistics and statistics [14]. There are some approaches to machine translation. ...
Article
Full-text available
The machine translation of numbers from the English language into the Yorùbá language is an integral part of a machine translation system from the English language into the Yorùbá language as a numeral system is an important aspect of any language. This paper presents a computational approach to English number text translation into Yorùbá text. The approach shows the number of ways an English number text can be translated into the Yorùbá language and the various forms in which the translations can be done based on context. This was carried out by collecting numeral data from the Yorùbá literature, formulating context-free grammar, and implementing the model with the Python programming language. The evaluation of the system was carried out using the Bilingual Evaluation Understudy (BLEU) score. The result of the approach is a software artefact for translating number text in the context of simple English sentences to Yorùbá text. They can further be integrated as a module into a robust machine translation system for effective and accurate translation from the English language to the Yorùbá language.
Article
Full-text available
We studied two fundamental linguistic channels—the sentences and the interpunctions channels—and showed they can reveal deeper connections between texts. The applied theory does not follow the actual paradigm of linguistic studies. As a study case, we considered the Greek New Testament, with the purpose of determining mathematical connections between its texts and possible differences in the writing style (mathematically defined) of the writers and in the reading skill required of their readers. The analysis was based on deep-language parameters and communication/information theory. To set the New Testament texts in the larger Greek classical literature, we considered texts written by Aesop, Polybius, Flavius Josephus, and Plutarch. The results largely confirmed what scholars have found about the New Testament texts, therefore giving credibility to the theory. The Gospel according to John is very similar to the fables written by Aesop. Surprisingly, the Epistle to the Hebrews and Apocalypse are each other’s “photocopies” in the two linguistic channels and not linked to all other texts. These two texts deserve further study by historians of the early Christian church literature at the level of meaning, readers, and possible Old Testament texts that might have influenced them. The theory can guide scholars to study any literary corpus.
Preprint
Full-text available
We study two fundamental linguistic channels ‒ the Sentences and the Interpunctions channels ‒ and show they can reveal deeper connections between texts. The theory applied does not follow the actual paradigm of linguistic studies. As study‒case, we consider the Greek New Testament, with the purpose of determining mathematical connections between its texts and possible differences in writing style (mathematically defined) of writers, and in reading skill required to their readers. The analysis is based on deep‒language parameters and communication/information theory. To set the New Testament texts in the larger Greek Classical Literature, we consider texts written by Aesop, Polybius, Flavius Josephus and Plutarch. The results largely confirm what scholars have found about the New Testament texts giving, therefore, credibility to the theory. The gospel according to John is very similar to Fables written by Aesop. Surprisingly, the Epistle to the Hebrews and Apocalypse, are each other “photocopy” in the two linguistic channels, and not linked to all other texts. These two texts deserve further study by historians of the early Christian Church Literature at the level of meaning, readers and possible Old Testament texts which might have influenced them. The theory can guide scholars to study any literary corpus.
Article
Full-text available
Machine translation (namely MT) has been one of the most popular fields in computational linguistics and Artificial Intelligence (AI). As one of the most promising approaches, MT can potentially break the language barrier of people from all over the world. Despite a number of studies in MT, there are few studies in summarizing and comparing MT methods. To this end, in this paper, we principally focus on presenting the two mainstream MT schemes: statistical machine translation (SMT) and neural machine translation (NMT), including their basic rationales and developments. Meanwhile, the detailed translation models are also presented, such as the word-based model, syntax-based model, and phrase-based model in statistical machine translation. Similarly, approaches in NMT, such as the recurrent neural network-based, attention mechanism-based, and transformer-based models are presented. Last but not least, the evaluation approaches also play an important role in helping developers to improve their methods better in MT. The prevailing machine translation evaluation methodologies are also presented in this article.
Article
Full-text available
The article identifies the possible classifications of machine translation, highlights the clearest typology, and identifies the advantages and disadvantages of each type of machine translation. Intercultural communication is difficult to imagine without the use of translation, but acquiring the competence of a translator requires a lot of time and effort. Therefore, it is difficult to overestimate the relevance of studying and solving problems related to machine translation, and the importance of its practical application in overcoming the language barrier. The term “Machine Translation” (abbreviated MT, in Ukrainian “Машинний переклад” or “МП”) refers to the action when one natural language is translated into another one using special software. During the research it was found that there are several classifications of machine translation (by the number of languages, by the direction of translation, by the role that a person plays in the process of MT). However, we have considered the most commonly used typology – the division into two main groups: machine translation based on rules and statistical. Hybrid systems are singled out, which are designed to combine the most effective features of rule-based systems and statistical systems. The study describes these four types of machine translation, their features, causes and uses. In addition, it is specified which programs belong to a particular type of machine translation. The article also points out the advantages and disadvantages of each of the four types of machine translation. Based on the study, it was concluded that currently becoming increasingly popular hybrid approaches designed to combine the advantages of classical and statistical approaches. At the moment, MT systems are not suitable for working with texts that contain a large number of complex and complex sentences and work well mainly at the phrase level. Key words: machine translation, hybrid system, statistical system, translation memory, rule-based system.
Conference Paper
Full-text available
Statistical post-editing (SPE) of the output produced by rule-based MT (RBMT) systems has been reported to produce extraordinary BLEU (and other automatic evaluation) score improvements. SPE has also been applied to the output of statistical MT (SMT) systems, albeit with more mixed results. We present a statistical post-editing pipeline and evaluate the outputs using automatic and human evaluation techniques, comparing the two SPE pipeline systems (RBMT + SPE and SMT + SPE) with the pure RBMT and SMT system, in an SPE scenario that uses independently existing bitext data, rather than manually corrected first stage MT output, as its training data. Our results show that although automatic evaluation metrics favour the pure SMT system, human evaluators prefer the output provided by the statistically post-edited RBMT system.
Article
This article presents a hybrid architecture which combines rule-based machine translation (RBMT) with phrase-based statistical machine translation (SMT). The hybrid translation system is guided by the rule-based engine. Before the transfer step, a varied set of partial candidate translations is calculated with the SMT system and used to enrich the tree-based representation with more translation alternatives. The final translation is constructed by choosing the most probable combination among the available fragments using monotone statistical decoding following the order provided by the rule-based system. We apply the hybrid model to a pair of distantly related languages, Spanish and Basque, and perform extensive experimentation on two different corpora. According to our empirical evaluation, the hybrid approach outperforms the best individual system across a varied set of automatic translation evaluation metrics. Following some output analysis to better understand the behaviour of the hybrid system, we explore the possibility of adding alternative parse trees and extra features to the hybrid decoder. Finally, we present a twofold manual evaluation of the translation systems studied in this paper, consisting of (i) a pairwise output comparison and (ii) a individual task-oriented evaluation using HTER. Interestingly, the manual evaluation shows some contradictory results with respect to the automatic evaluation; humans tend to prefer the translations from the RBMT system over the statistical and hybrid translations.
Article
Two of the most popular Machine Translation (MT) paradigms are rule based (RBMT) and corpus based, which include the statistical systems (SMT). When scarce parallel corpus is available, RBMT becomes particularly attractive. This is the case of the Chinese--Spanish language pair. This article presents the first RBMT system for Chinese to Spanish. We describe a hybrid method for constructing this system taking advantage of available resources such as parallel corpora that are used to extract dictionaries and lexical and structural transfer rules. The final system is freely available online and open source. Although performance lags behind standard SMT systems for an in-domain test set, the results show that the RBMT’s coverage is competitive and it outperforms the SMT system in an out-of-domain test set. This RBMT system is available to the general public, it can be further enhanced, and it opens up the possibility of creating future hybrid MT systems.
Conference Paper
In this paper we address the problem of automatic acquisition of a human-oriented translation dictionary from a large-scale parallel corpus. The initial translation equivalents can be extracted with the help of the techniques and tools developed for the phrase-table construction in statistical machine translation. The acquired translation equivalents usually provide good lexicon coverage, but they also contain a large amount of noise. We propose a supervised learning algorithm for the detection of noisy translations, which takes into account the context and syntax features, averaged over the sentences in which a given phrase pair occurred. Across nine European language pairs the number of serious translation errors is reduced by 43.2%, compared to a baseline which uses only phrase-level statistics.
Article
This paper describes the development of the Spanish-German dictionary used in our hybrid MT system. The compilation process relies entirely on open source tools and freely available language resources. Our bilingual dictionary of around 33,700 entries may thus be used, distributed and further enhanced as convenient.
Conference Paper
Since statistical machine translation (SMT) and translation memory (TM) complement each other in matched and unmatched regions, integrated models are proposed in this paper to incorporate TM information into phrase-based SMT. Unlike previous multi-stage pipeline approaches, which directly merge TM result into the final output, the proposed models refer to the corresponding TM information associated with each phrase at SMT decoding. On a Chinese-English TM database, our experiments show that the proposed integrated Model-III is significantly better than either the SMT or the TM systems when the fuzzy match score is above 0.4. Furthermore, integrated Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction in comparison with the pure SMT system. Besides, the proposed models also outperform previous approaches significantly.