Content uploaded by Pragya Katyayan
Author content
All content in this area was uploaded by Pragya Katyayan on May 05, 2021
Content may be subject to copyright.
Quality-Based Ranking of
Translation Outputs
Nivedita Bharti, Nisheeth Joshi, Iti Mathur,
and Pragya Katyayan
Banasthali Vidyapith
Abstract—Translation ranking is inherently of great significance for machine translation
(MT), as it allows the comparison of performances of multiple MT systems as well as for
its efficient training. This article demonstrates a mechanism that is used for ranking the
translation outputs generated by the MT systems from best to worst. To implement this
approach, the system exploits a supervised learning algorithm trained over existing
manual ranking by using various features obtained after the linguistic analysis of both
source and target side sentences without relying on the reference translation.
&IN PRESENT DAYS,MT systems have achieved
significant improvements. Although, the transla-
tion quality output generated from the translation
systems are neither consistent nor perfect across
multiple unseen test sentences. Due to this rea-
son, nowadays, researchers have been focusing
more on developing the techniques for estimating
the quality of translated text content and procur-
ing the translation performance indications in
real-time translation environment without any
human intervention (i.e., without accessing the
reference (correct) translations). The presented
article is focusing on solving this challenging task
of automatically ranking the alternative transla-
tion outputs from best to worst, obtained from
several MT systems corresponding to a given
source sentence. Commonly, this ranking task per-
formed manually by human judges (annotators)
has been acknowledged as a practice of evaluating
the translation outputs.
Additionally, in the scenario when several MT
systems are used in combination, then some sys-
tems can translate a given input source sentence
perfectly (correct) while some could not trans-
late well. In such a case, the best translation
selection could result in boosting performance.
Therefore, to deal with these kinds of issues, we
try to develop a translation ranking system by
exploiting the machine learning (ML) techniques
for finally imitating human behavior. In detail,
this automatic ranking translation system can
rank the various translation outputs generated
corresponding to a given source sentence, as
per their comparative quality. This framework is
Digital Object Identifier 10.1109/MITP.2020.2976009
Date of current version 17 July 2020.
Theme Article: Artificial Intelligence
Theme Article: Artificial Intelligence
July/August 2020 Published by the IEEE Computer Society 1520-9202 ß2020 IEEE 21
Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 14:13:33 UTC from IEEE Xplore. Restrictions apply.
developed by using a regression algorithm,
which is trained by using the previously manu-
ally annotated ranks together with the numerous
qualitative criteria on the sentences (source and
target both).
RANKING PROBLEM DESCRIPTION
This submission is focusing on developing a
ranking system that is used to rank the alterna-
tive translation outputs similarly as a human
would rank. Particularly, this ranking system has
given multiple translation outputs correspond-
ing to a given source sentence. The multiple MT
systems have generated these alternative trans-
lation outputs for a given source sentence. In
other words, the main goal of this task is to rank
all those generated translations according to
their translation quality by considering numer-
ous qualitative measures over the translation
outputs. Mathematically, we describe a ranking
system in (1) using the three tuples as
Rsystem ¼S; T; R
fg (1)
where tuple Srepresent a given source sentence
with the corresponding set of translations as
represented by tuple T¼ft1;t
2;t
3;...;t
ngwith tk
is the kth translation corresponding to the
source side sentence S, and nis the total count
of generated translations.
Here, every individual translation in the
translations set Tis allied with an ordinal list
of ranks (judgments) represented by tuple
R¼fr1;r
2;r
3;...;r
mg. Where, rkcorresponds to
the rank given to the translation tkcompared to
other alternative translations present in the set
T. From this point, it is getting clear that this
qualitative ranking mechanism does not indicate
any generic or absolute quality measure. As the
translation ranking is done at the sentence level,
so this reveals that at a time, the inherent mech-
anism is focusing on a single sentence by consid-
ering the alternative translations and finally
taking a decision about the translation quality.
Thus, any annotated rank carry a meaning for
the sentence in consideration only, and the corre-
spondingly generated multiple translations. In
particular, every source sentence Sjis related
with a translation set Tj¼ðtj
1;t
j
2;t
j
3;...;t
j
nÞ.
Where, tj
iis the ith translation of the jth sentence,
and nis the translation count. While, each trans-
lation list is associated with a list which contains
the relative rankings Rj¼ðrj
1;r
j
2;r
j
3;...;r
j
mÞ, and
rj
kcorresponds to the ranking on the kth transla-
tion of the jth source sentence. Finally, one
important point to note here is that in the sce-
nario, if the quality of two translation outputs is
similar, then such a case is called a tie, and the
same rank would be assigned to both the transla-
tion candidates.
NEED OF RANKING SYSTEM
There is a need for developing a ranking sys-
tem because the translation industry requires
more transparency regarding MT systems’
strengths and weaknesses. Surprisingly, in this
wider world, the professional community nowa-
days has focused on the discussions of the serv-
ices provided by the MT systems and its effects
on commercial software. Some of these impor-
tant needs are given as follows.
Ranking imparts to the definition of translation
quality: Enhancing the translation quality is
essential for providers to ignore the viable
lawsuits. Since a flaw in translation finally
leads to safety violation.
1
The ranking gives a set of criteria and rationale
for translation systems evaluation: According
to Church & Hovy, “it should be interpretable
that what an MT system cannot and can do.”
2
Particularly, in the scenario when the transla-
tion services are used by the general public
and are applicable at large scale.
Ranking responds to the demands of consum-
ers for effortlessly accountable information
about the translation systems quality: Since
ranking, the accuracy (usefulness) of the
translation systems could help the end-user
in deciding about which translation systems
suit their purpose.
RELATED WORK
The concept of translation ranking has been
employed in various MT related tasks. Initially, in
WMT07 evaluation task, the ranking concept was
used for evaluating the machine translation
(MT).
3
In this direction, previous some contribu-
tions based on the ML approach are proposed for
Artificial Intelligence
22 IT Professional
Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 14:13:33 UTC from IEEE Xplore. Restrictions apply.
training on the rank data.
4,5
Although, reference
translations were used for these approaches, and
were evaluated only by obtaining an overall rank-
ing of corpus-level.
On the contrary, focusing on our reference-
free approaches, Rosti et al.
6
performed transla-
tion selection by using generalized models at the
sentence level. Particularly, they exploited
the re-ranking of N-best lists combined from
multiple translation systems to comparatively
rank the translation outputs. Specia et al.
7
devel-
oped one translation quality prediction model
corresponding to every translation system. After
that, the scores predicted by these individual
models are used for ranking each alternative
candidate translations of the corresponding
same source language sentence.
Later, in this context, Soricut and Narsle
8
employed ML to rank the multiple alternative
translations and finally selected the highest
ranked translation output. Subsequently, Avra-
midis and Popovic
9
performed ranking of alter-
native translation outputs using logistic
regression as pairwise classifiers with black-box
features. Tezcan et al.
10
trained the regression
model using baseline features in combination
with the word level predictions as features to
develop the ranking model at sentence-level.
Later, Chen et al.
11
extracted neural features and
the cross-entropy features for training the SVR
model to build the ranking model finally. Etche-
goyhen et al.
12
employed a minimalist approach
for sentence-level ranking model development.
The method used by the authors named as
minimalist because it required few resources
and minimal deployment efforts.
APPROACH
The sentence-level ranking of alternative
translations has been addressed as a typical
supervised machine learning problem, as shown
in Figure 1. The steps involved in implementing
this approach are described as: first, alternative
candidate translations corresponding to a single
source sentence from a given input data corpus
are generated at a time by inputting it to multiple
MT systems. Second, we develop a feature
extraction module that extracts a feature vector
by analyzing the source sentence, translation
outputs, and the translation process. Third, the
obtained feature vectors with manually anno-
tated quality ranks referred to as training instan-
ces are given to the ML algorithm for developing
the ranking model. Finally, this ranking model
predicts the ranks on an unseen test set.
DATASETS AND MT SYSTEM
We have collected 5000 sentences from the
tourism domain for the development of our pro-
posed model on the English–Hindi language pair.
The dataset is freely available on “Technology
Development for Indian Languages” website. Fur-
ther, we split this dataset into training and test
set in the ratio 80% for training and 20% to test
the ranking model.
Subsequently, we have used three MT systems:
Google, Bing, and Moses Phrase-based Model
13
to
Figure 1. Ranking system architecture.
July/August 2020 23
Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 14:13:33 UTC from IEEE Xplore. Restrictions apply.
obtain the alternative candidate translations
corresponding to a given source sentence. One
example of the alternative candidate translations
generated by the introduced translation systems,
and their reference translation, corresponding to
a given source sentence across the test set is given
as follows.
Source Sentence: The government of India office
has more information on other destinations
as well.
QUALITY SCORE
The translated outputs corresponding to the
given source text are annotated manually on a
five-point scale using the human translation edit
rate
14
score to generate the training sample for
the ranking model. The descriptions about the
five-point scale used for annotating the quality
of the translation outputs are given using the fol-
lowing scoring scheme:
1 – the translation is intelligible and perfect;
2 – the translation is generally intelligible and
clear but it requires minor correction;
3 – translation needs a significant editing
effort so that it reach the level of publishing;
4 – the translation contains different errors
and mistranslations that requires major
correction;
5 – the translation is very poor.
FEATURE EXTRACTION
We used the feature extraction module to
obtain a feature vector, which indicates the
translation quality. This feature vector is repre-
sented mathematically in (2), and is extracted
for every source and its translations pair ðSi;T
iÞ
with i¼1, 2, 3, ...,n
fiðÞ¼GS
i;T
i
ðÞ:(2)
Here, Gis a feature generation function that
extracts the feature vector, given a single source
sentence and its corresponding translation out-
puts. In this case, each particular feature vector
fðiÞwhich is obtained from the ith source sen-
tence, and their corresponding ranks list defines
a training instance as given in (3). Furthermore,
a training example set containing Ninstances is
formulated as given in (4)
IiðÞ ¼fi;r
i
ðÞ (3)
T¼fi;r
i
ðÞ
fg
N
i¼1:(4)
Finally, given a training example set, the goal
of the learning algorithm is to define a ranking
function that minimizes the total error between
the predicted ranks list as given in (5) with rank-
ing function predicts a list of ranks b
ri, given a fea-
ture vector fðiÞ
X
m
i¼1
error ri;b
ri
ðÞ:(5)
Mainly, in this article, we used two feature sets,
namely: black-box and glass-box, in order to
extract the features indicating the translation out-
put quality, by using various linguistic analysis
tools for analyzing the source languge sentence,
the alternative target translations, andthe aspects
of the translation process. According to their ori-
gin, these feature sets are described as follows.
Black-Box Features
The black-box features are obtained by auto-
matically analyzing the source and target senten-
ces both. The black-box feature set is further
categorized as follows.
Surface features: These kinds of features are
simple and are used for accounting the diffi-
culty of the translation task by merely analyz-
ing the source and target sentences. It
includes the tokens (words) count present in
both source and target sentences, unknown
words count, and the average characters
count computed per token, sentence length,
and source to target length ratio.
Target LM-based scores: This LM is an indica-
tion of fluency and plausibility of the target
sentence. Since it gives statistics about the
Artificial Intelligence
24 IT Professional
Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 14:13:33 UTC from IEEE Xplore. Restrictions apply.
correctness of the word’s sequences for a
specific language. This category mainly cov-
ers features like smoothed unigram, bigram,
and trigram probability of the target lan-
guage sentence. Also, the unigram, bigram,
and trigram perplexity of the target senten-
ces are considered under this category.
IBM Model1 scores: IBM1 model
15
score is
based on a bag of word translation model,
which is used to measure the quality of
association of all feasible alignment probabil-
ities between the tokens of the source
sentence and the target side sentences.
This category includes the scores in both
directions.
Parsing based features: This category uses the
features obtained from the PCFG parsing
16
for
source and target sentence both. These fea-
tures cover more complex phenomena like
long-distance structures and grammatical flu-
ency. The PCFG parsing works by producing
numerous possible parse trees correspond-
ing to a given input sentence, which resulted
in producing an n-best list of parse candi-
dates. These kinds of features include: count
of n-best trees generated, log-likelihood of
parse trees and best parse tree confidence.
Shallow grammatical match counts: To obtain
adequacy features, the similar or same gram-
matical structures must occur on the source
and target translations both. This category
covers the occurrences of the basic node
labels of the PCFG parse tree on both source
and target sentences. Particularly, it includes:
Nouns, Verbs, NPs, VPs, PPs, and subordinate
clauses.
Source complexity based features: It includes
the features such as average source sentence
token length, average count of translations per
token present in the source sentence, percent-
age of 1-gram, 2-grams, 3-grams, 4-grams in
lower frequency quartile and higher frequency
quartile present in a corpus of the source sen-
tence, percentage of 1-gram, 2-grams,
3-grams, 4-grams of source sentence present
in a given corpus.
Contrastive scoring: Each target translation is
scored with automatic evaluation metrics
(such as METEOR)
17
as a feature by using alter-
native translations as reference translations.
18
Glass-Box Features
This category of features generally relied on
the internal workings of the MT system and
described the processes involved in generating
the translation. It is also referred to as MT sys-
tem features. These features are given as follows.
Count of n-bests corresponding to each
source language sentence.
Word posterior probability.
Costs obtained by Moses such as language
model cost, distortion cost, weighted token
penalty cost, and unweighted token penalty
cost.
MT system output back translation. For this,
each back-translated sentence is scored
using BLEU, by treating source sentences as
the translation reference.
LEARNING METHODS
To develop the ranking model for sentence-
level translation quality ranking task, we rely on
the use of several powerful supervised machine
learning-based regression algorithms. In other
words, in an attempt to rank the translations as
a regression task, we try to build the models
which can assign a continuous value as the qual-
ity measure of the sentence. Mainly, we aim to
develop the ranking models which automatically
predict the rank (continuous values) in the range
let say [1:5], in a similar way as human do manu-
ally. In this context, some of these effective
regression algorithms used in the past and was
also proved successful by other researchers are
given as follows:
Partial least square regression;
Linear regression;
Lasso;
SVR; and
M5P.
EVALUATION MEASURES
The developed ranking model is evaluated by
computing the prediction errors over actual and
predicted values, and the correlations between
the model predicted rank and the manually
annotated rank. Particularly, for prediction
errors evaluation, we computed mean average
error (MAE), and root mean squared error
July/August 2020 25
Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 14:13:33 UTC from IEEE Xplore. Restrictions apply.
(RMSE), for a given instance idenoting the
actual target value, ^
yirepresenting the value
estimated by the ranking model. These evalua-
tion measures used to evaluate the performance
of learning models are briefly described below.
MAE: MAE (6) is used to measure the average
magnitude of the errors in a set of predictions
MAE ¼Pn
i¼1^
yiyi
jj
n:(6)
RMSE: RMSE (7) is a quadratic scoring rule use
to measure the average magnitude of the error
RMSE ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Pn
i¼1^
yiyi
ðÞ
2
n
s:(7)
Kendall tau rank correlations: This ranking eval-
uation measure (8) with nrepresenting the size
of each sample, Crepresenting a pair of concor-
dant (if their ranking agrees), Drepresenting a
discordant (if their ranking disagrees), and nei-
ther concordant nor discordant (if their rank-
ings are similar or equal) is defined as
t¼CD
1
2nn1ðÞ
:(8)
Spearman’s rank correlation: The Spearman’s
rank correlation (9) with Nrepresenting the
counts of alternative translations and direpre-
senting the difference in ranks annotated to a
translated sentence by two rankings is given by
r¼16PN
i¼1d2
i
NN
21ðÞ
:(9)
RESULTS
In this section, we have shown the experi-
mental results of our proposed ranking model.
Particularly, Table 1 shows the scores predicted
by our developed ranking model using the SVR
regression algorithm. From this result, we have
seen that google translator predicts the highest
quality score whereas, Moses-phrase based
translator predicts the lowest score, and there-
fore former is ranked as first (best) and later as
third (worst), respectively.
CONCLUSION
This article addressed the challenging task of
automatically ranking the translation outputs to
predict the quality of translation. The problem is
addressed as a supervised ML algorithm by
using a regression algorithm built on several fea-
tures. Later, correlations with the manual judg-
ments (rankings) show a success in developing a
mechanism that is used to obtain the translation
ranking based on their quality. Finally, the per-
formance of the followed mechanism is signifi-
cant and is remarkably higher, without any
access to the gold reference translation.
&REFERENCES
1. E. Westfall, “Legal implications of MT on-line,” in Proc.
2nd. Amta Conf., 1996, pp. 231–232.
2. K. W. Church and E. H. Hovy, “Good applications for
crummy machine translation,” Mach. Transl., vol. 8,
no. 4, pp. 239–258, 1993.
3. C. Callison-Burch, C. Fordyce, P. Koehn, C. Monz, and
J. Schroeder, “(Meta-) evaluation of machine
translation,” in Proc. 2nd Workshop Statist. Mach.
Transl., 2007, pp. 136–158.
4. Y. Ye, M. Zhou, and C. Y. Lin, “Sentence level machine
translation evaluation as a ranking problem: one step
aside from BLEU,” in Proc. 2nd Workshop Statist.
Mach. Transl., 2007, pp. 240–247.
5. K. Duh, “Ranking vs. regression in machine translation
evaluation,” in Proc. 3d Workshop Statist. Mach.
Transl., 2008, pp. 191–194.
6. A. V. Rosti, N. F. Ayan, B. Xiang, S. Matsoukas,
R. Schwartz, and B. Dorr, “Combining outputs from
multiplemachine translationsystems,” in Proc. Main
Conf. Human Lang. Technol., Conf. North Amer. Chapter
Assoc. Comput. Linguistics,2007, pp. 228–235.
Table 1. Performance evaluation of the developed ranking model.
Translation systems MAE RMSE Spearman’s correlation Kendall’s Tau Rank
Google 0.0698 0.1140 0.7881 0.6919 1
Bing 0.0759 0.1263 0.7554 0.6617 2
Moses-Phrase based 0.1507 0.2095 0.4255 0.3088 3
Artificial Intelligence
26 IT Professional
Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 14:13:33 UTC from IEEE Xplore. Restrictions apply.
7. L. Specia, D. Raj, and M. Turchi, “Machine translation
evaluation versus quality estimation,” Mach. Transl.,
vol. 24, no. 1, pp. 39–50, 2010.
8. R. Soricut and S. Narsale, “Combining quality
prediction and system selection for improved
automatic translation output,” in Proc. 7th Workshop
Statist. Mach. Transl., 2012, pp. 163–170.
9. E. Avramidis and M. Popovic, “Machine learning
methods for comparative and time-oriented quality
estimation of machine translation output,” in
Proc. 8th Workshop Statist. Mach. Transl., 2013,
pp. 329–336.
10. A. Tezcan, V. Hoste, B. Desmet, and L. Macken,
“UGENT-LT3 SCATE system for machine translation
quality estimation,” in Proc. 10th Workshop Statist.
Mach. Transl., 2015, pp. 353–360.
11. Z. Chen et al., “Improving machine translation quality
estimation with neural network features,” in Proc. 2nd
Conf. Mach. Transl., 2017, pp. 551–555.
12. T. Etchegoyhen, E. M. Garcia, and A. Azpeitia,
“Supervised and unsupervised minimalist quality
estimators: Vicomtech’s participation in the WMT 2018
quality estimation task,” in Proc. 3rd Conf. Mach.
Transl., Shared Task Papers, 2018, pp. 782–787.
13. P. Koehn, F. J. Och, and D. Marcu, “Statistical phrase-
based translation,” in Proc. Conf. North Amer. Chapter
Assoc. Comput. Linguistics Human Lang. Technol.,
2003, vol. 1, pp. 48–54.
14. M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and
J. Makhoul, “A study of translation edit rate with
targeted human annotation,” in Proc. 7th Conf. Assoc.
Mach. Transl. Amer., 2006, vol. 200, no. 6,
pp. 223–231.
15. P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and
R. L. Mercer, “The mathematics of statistical machine
translation: Parameter estimation,” Comput.
Linguistics, vol. 19, no. 2, pp. 263–311, 1993.
16. S. Petrov, L. Barrett, R. Thibaux, and D. Klein,
“Learning accurate, compact, and interpretable tree
annotation,” in Proc. 21st Int. Conf. Comput.
Linguistics 44th Annu. Meeting Assoc. Comput.
Linguistics, 2006, pp. 433–440.
17. S. Banerjee and A. Lavie, “METEOR: An automatic
metric for MT evaluation with improved correlation with
human judgments,” in Proc. ACL Workshop Intrinsic
Extrinsic Eval. Measures Mach. Transl. Summarization,
2005, pp. 65–72.
18. R. Soricut, N. Bach, and Z. Wang, “The SDL language
weaver systems in the WMT12 quality estimation
shared task,” in Proc. 7th Workshop Statist. Mach.
Transl., 2012, pp. 145–151.
Nivedita Bharti is currently a Full-Time Research
Scholar with Banasthali Vidyapith, Vanasthali, India.
Her research interests include the development of
models and methods for quality estimation of MT sys-
tems for Indian Languages, and has an interest in the
field of natural language processing, machine trans-
lation, machine learning, and deep learning. She
received the M.Tech. degree in computer science.
Contact her at nivedita2bharti@gmail.com.
Nisheeth Joshi is currently an Associate Profes-
sor with the Department of Computer Science,
Banasthali Vidyapith, Vanasthali, India. He primarily
works in the area of machine translation, information
retrieval, and cognitive computing. He has more than
12 years of teaching experience. He received the
Ph.D. degree in computer science and engineering
with specialization in evaluation of machine transla-
tion. He is a Life Member of the Computer Society
of India and the Institution of Electronics and
Telecommunications Engineers, India. Contact him
at jnisheeth@banasthali.in.
Iti Mathur is currently an Associate Professor with
the Department of Computer Science, Banasthali
Vidyapith, Vanasthali, India. She primarily works in
the field of information retrieval, ontology engineer-
ing, and machine translation. She has more than 15
years of experience in teaching and research. She
received the Ph.D. degree in computer science with
specialization in the area of ontologies. She is a Life
Member of the Computer Society of India. Contact
her at mathur_iti@rediffmail.com.
Pragya Katyayan is currently a Full-Time
Research Scholar with Banasthali Vidyapith, Vanas-
thali, India. Her research interest lies in the area of
machine translation, natural language processing,
information retrieval, and deep learning. Before join-
ing the Ph.D. programme, she received the Master of
Science degree in computer science and has
worked as a consultant on various projects based on
natural language processing. She is a Student Mem-
ber of the Computer Society of India. Contact her at
pragya.katyayan@outlook.com.
July/August 2020 27
Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 14:13:33 UTC from IEEE Xplore. Restrictions apply.