ArticlePDF Available

Quality-Based Ranking of Translation Outputs

Authors:
  • Banasthali Vidyapith, India

Abstract and Figures

Translation ranking is inherently of great significance for machine translation (MT), as it allows the comparison of performances of multiple MT systems as well as for its efficient training. This article demonstrates a mechanism that is used for ranking the translation outputs generated by the MT systems from best to worst. To implement this approach, the system exploits a supervised learning algorithm trained over existing manual ranking by using various features obtained after the linguistic analysis of both source and target side sentences without relying on the reference translation.
Content may be subject to copyright.
Quality-Based Ranking of
Translation Outputs
Nivedita Bharti, Nisheeth Joshi, Iti Mathur,
and Pragya Katyayan
Banasthali Vidyapith
Abstract—Translation ranking is inherently of great significance for machine translation
(MT), as it allows the comparison of performances of multiple MT systems as well as for
its efficient training. This article demonstrates a mechanism that is used for ranking the
translation outputs generated by the MT systems from best to worst. To implement this
approach, the system exploits a supervised learning algorithm trained over existing
manual ranking by using various features obtained after the linguistic analysis of both
source and target side sentences without relying on the reference translation.
&IN PRESENT DAYS,MT systems have achieved
significant improvements. Although, the transla-
tion quality output generated from the translation
systems are neither consistent nor perfect across
multiple unseen test sentences. Due to this rea-
son, nowadays, researchers have been focusing
more on developing the techniques for estimating
the quality of translated text content and procur-
ing the translation performance indications in
real-time translation environment without any
human intervention (i.e., without accessing the
reference (correct) translations). The presented
article is focusing on solving this challenging task
of automatically ranking the alternative transla-
tion outputs from best to worst, obtained from
several MT systems corresponding to a given
source sentence. Commonly, this ranking task per-
formed manually by human judges (annotators)
has been acknowledged as a practice of evaluating
the translation outputs.
Additionally, in the scenario when several MT
systems are used in combination, then some sys-
tems can translate a given input source sentence
perfectly (correct) while some could not trans-
late well. In such a case, the best translation
selection could result in boosting performance.
Therefore, to deal with these kinds of issues, we
try to develop a translation ranking system by
exploiting the machine learning (ML) techniques
for finally imitating human behavior. In detail,
this automatic ranking translation system can
rank the various translation outputs generated
corresponding to a given source sentence, as
per their comparative quality. This framework is
Digital Object Identifier 10.1109/MITP.2020.2976009
Date of current version 17 July 2020.
Theme Article: Artificial Intelligence
Theme Article: Artificial Intelligence
July/August 2020 Published by the IEEE Computer Society 1520-9202 ß2020 IEEE 21
Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 14:13:33 UTC from IEEE Xplore. Restrictions apply.
developed by using a regression algorithm,
which is trained by using the previously manu-
ally annotated ranks together with the numerous
qualitative criteria on the sentences (source and
target both).
RANKING PROBLEM DESCRIPTION
This submission is focusing on developing a
ranking system that is used to rank the alterna-
tive translation outputs similarly as a human
would rank. Particularly, this ranking system has
given multiple translation outputs correspond-
ing to a given source sentence. The multiple MT
systems have generated these alternative trans-
lation outputs for a given source sentence. In
other words, the main goal of this task is to rank
all those generated translations according to
their translation quality by considering numer-
ous qualitative measures over the translation
outputs. Mathematically, we describe a ranking
system in (1) using the three tuples as
Rsystem ¼S; T; R
fg (1)
where tuple Srepresent a given source sentence
with the corresponding set of translations as
represented by tuple T¼ft1;t
2;t
3;...;t
ngwith tk
is the kth translation corresponding to the
source side sentence S, and nis the total count
of generated translations.
Here, every individual translation in the
translations set Tis allied with an ordinal list
of ranks (judgments) represented by tuple
R¼fr1;r
2;r
3;...;r
mg. Where, rkcorresponds to
the rank given to the translation tkcompared to
other alternative translations present in the set
T. From this point, it is getting clear that this
qualitative ranking mechanism does not indicate
any generic or absolute quality measure. As the
translation ranking is done at the sentence level,
so this reveals that at a time, the inherent mech-
anism is focusing on a single sentence by consid-
ering the alternative translations and finally
taking a decision about the translation quality.
Thus, any annotated rank carry a meaning for
the sentence in consideration only, and the corre-
spondingly generated multiple translations. In
particular, every source sentence Sjis related
with a translation set Tj¼ðtj
1;t
j
2;t
j
3;...;t
j
nÞ.
Where, tj
iis the ith translation of the jth sentence,
and nis the translation count. While, each trans-
lation list is associated with a list which contains
the relative rankings Rj¼ðrj
1;r
j
2;r
j
3;...;r
j
mÞ, and
rj
kcorresponds to the ranking on the kth transla-
tion of the jth source sentence. Finally, one
important point to note here is that in the sce-
nario, if the quality of two translation outputs is
similar, then such a case is called a tie, and the
same rank would be assigned to both the transla-
tion candidates.
NEED OF RANKING SYSTEM
There is a need for developing a ranking sys-
tem because the translation industry requires
more transparency regarding MT systems’
strengths and weaknesses. Surprisingly, in this
wider world, the professional community nowa-
days has focused on the discussions of the serv-
ices provided by the MT systems and its effects
on commercial software. Some of these impor-
tant needs are given as follows.
Ranking imparts to the definition of translation
quality: Enhancing the translation quality is
essential for providers to ignore the viable
lawsuits. Since a flaw in translation finally
leads to safety violation.
1
The ranking gives a set of criteria and rationale
for translation systems evaluation: According
to Church & Hovy, “it should be interpretable
that what an MT system cannot and can do.”
2
Particularly, in the scenario when the transla-
tion services are used by the general public
and are applicable at large scale.
Ranking responds to the demands of consum-
ers for effortlessly accountable information
about the translation systems quality: Since
ranking, the accuracy (usefulness) of the
translation systems could help the end-user
in deciding about which translation systems
suit their purpose.
RELATED WORK
The concept of translation ranking has been
employed in various MT related tasks. Initially, in
WMT07 evaluation task, the ranking concept was
used for evaluating the machine translation
(MT).
3
In this direction, previous some contribu-
tions based on the ML approach are proposed for
Artificial Intelligence
22 IT Professional
Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 14:13:33 UTC from IEEE Xplore. Restrictions apply.
training on the rank data.
4,5
Although, reference
translations were used for these approaches, and
were evaluated only by obtaining an overall rank-
ing of corpus-level.
On the contrary, focusing on our reference-
free approaches, Rosti et al.
6
performed transla-
tion selection by using generalized models at the
sentence level. Particularly, they exploited
the re-ranking of N-best lists combined from
multiple translation systems to comparatively
rank the translation outputs. Specia et al.
7
devel-
oped one translation quality prediction model
corresponding to every translation system. After
that, the scores predicted by these individual
models are used for ranking each alternative
candidate translations of the corresponding
same source language sentence.
Later, in this context, Soricut and Narsle
8
employed ML to rank the multiple alternative
translations and finally selected the highest
ranked translation output. Subsequently, Avra-
midis and Popovic
9
performed ranking of alter-
native translation outputs using logistic
regression as pairwise classifiers with black-box
features. Tezcan et al.
10
trained the regression
model using baseline features in combination
with the word level predictions as features to
develop the ranking model at sentence-level.
Later, Chen et al.
11
extracted neural features and
the cross-entropy features for training the SVR
model to build the ranking model finally. Etche-
goyhen et al.
12
employed a minimalist approach
for sentence-level ranking model development.
The method used by the authors named as
minimalist because it required few resources
and minimal deployment efforts.
APPROACH
The sentence-level ranking of alternative
translations has been addressed as a typical
supervised machine learning problem, as shown
in Figure 1. The steps involved in implementing
this approach are described as: first, alternative
candidate translations corresponding to a single
source sentence from a given input data corpus
are generated at a time by inputting it to multiple
MT systems. Second, we develop a feature
extraction module that extracts a feature vector
by analyzing the source sentence, translation
outputs, and the translation process. Third, the
obtained feature vectors with manually anno-
tated quality ranks referred to as training instan-
ces are given to the ML algorithm for developing
the ranking model. Finally, this ranking model
predicts the ranks on an unseen test set.
DATASETS AND MT SYSTEM
We have collected 5000 sentences from the
tourism domain for the development of our pro-
posed model on the English–Hindi language pair.
The dataset is freely available on “Technology
Development for Indian Languages” website. Fur-
ther, we split this dataset into training and test
set in the ratio 80% for training and 20% to test
the ranking model.
Subsequently, we have used three MT systems:
Google, Bing, and Moses Phrase-based Model
13
to
Figure 1. Ranking system architecture.
July/August 2020 23
Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 14:13:33 UTC from IEEE Xplore. Restrictions apply.
obtain the alternative candidate translations
corresponding to a given source sentence. One
example of the alternative candidate translations
generated by the introduced translation systems,
and their reference translation, corresponding to
a given source sentence across the test set is given
as follows.
Source Sentence: The government of India office
has more information on other destinations
as well.
QUALITY SCORE
The translated outputs corresponding to the
given source text are annotated manually on a
five-point scale using the human translation edit
rate
14
score to generate the training sample for
the ranking model. The descriptions about the
five-point scale used for annotating the quality
of the translation outputs are given using the fol-
lowing scoring scheme:
1 – the translation is intelligible and perfect;
2 – the translation is generally intelligible and
clear but it requires minor correction;
3 – translation needs a significant editing
effort so that it reach the level of publishing;
4 – the translation contains different errors
and mistranslations that requires major
correction;
5 – the translation is very poor.
FEATURE EXTRACTION
We used the feature extraction module to
obtain a feature vector, which indicates the
translation quality. This feature vector is repre-
sented mathematically in (2), and is extracted
for every source and its translations pair ðSi;T
iÞ
with i¼1, 2, 3, ...,n
fiðÞ¼GS
i;T
i
ðÞ:(2)
Here, Gis a feature generation function that
extracts the feature vector, given a single source
sentence and its corresponding translation out-
puts. In this case, each particular feature vector
fðiÞwhich is obtained from the ith source sen-
tence, and their corresponding ranks list defines
a training instance as given in (3). Furthermore,
a training example set containing Ninstances is
formulated as given in (4)
IiðÞ ¼fi;r
i
ðÞ (3)
T¼fi;r
i
ðÞ
fg
N
i¼1:(4)
Finally, given a training example set, the goal
of the learning algorithm is to define a ranking
function that minimizes the total error between
the predicted ranks list as given in (5) with rank-
ing function predicts a list of ranks b
ri, given a fea-
ture vector fðiÞ
X
m
i¼1
error ri;b
ri
ðÞ:(5)
Mainly, in this article, we used two feature sets,
namely: black-box and glass-box, in order to
extract the features indicating the translation out-
put quality, by using various linguistic analysis
tools for analyzing the source languge sentence,
the alternative target translations, andthe aspects
of the translation process. According to their ori-
gin, these feature sets are described as follows.
Black-Box Features
The black-box features are obtained by auto-
matically analyzing the source and target senten-
ces both. The black-box feature set is further
categorized as follows.
Surface features: These kinds of features are
simple and are used for accounting the diffi-
culty of the translation task by merely analyz-
ing the source and target sentences. It
includes the tokens (words) count present in
both source and target sentences, unknown
words count, and the average characters
count computed per token, sentence length,
and source to target length ratio.
Target LM-based scores: This LM is an indica-
tion of fluency and plausibility of the target
sentence. Since it gives statistics about the
Artificial Intelligence
24 IT Professional
Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 14:13:33 UTC from IEEE Xplore. Restrictions apply.
correctness of the word’s sequences for a
specific language. This category mainly cov-
ers features like smoothed unigram, bigram,
and trigram probability of the target lan-
guage sentence. Also, the unigram, bigram,
and trigram perplexity of the target senten-
ces are considered under this category.
IBM Model1 scores: IBM1 model
15
score is
based on a bag of word translation model,
which is used to measure the quality of
association of all feasible alignment probabil-
ities between the tokens of the source
sentence and the target side sentences.
This category includes the scores in both
directions.
Parsing based features: This category uses the
features obtained from the PCFG parsing
16
for
source and target sentence both. These fea-
tures cover more complex phenomena like
long-distance structures and grammatical flu-
ency. The PCFG parsing works by producing
numerous possible parse trees correspond-
ing to a given input sentence, which resulted
in producing an n-best list of parse candi-
dates. These kinds of features include: count
of n-best trees generated, log-likelihood of
parse trees and best parse tree confidence.
Shallow grammatical match counts: To obtain
adequacy features, the similar or same gram-
matical structures must occur on the source
and target translations both. This category
covers the occurrences of the basic node
labels of the PCFG parse tree on both source
and target sentences. Particularly, it includes:
Nouns, Verbs, NPs, VPs, PPs, and subordinate
clauses.
Source complexity based features: It includes
the features such as average source sentence
token length, average count of translations per
token present in the source sentence, percent-
age of 1-gram, 2-grams, 3-grams, 4-grams in
lower frequency quartile and higher frequency
quartile present in a corpus of the source sen-
tence, percentage of 1-gram, 2-grams,
3-grams, 4-grams of source sentence present
in a given corpus.
Contrastive scoring: Each target translation is
scored with automatic evaluation metrics
(such as METEOR)
17
as a feature by using alter-
native translations as reference translations.
18
Glass-Box Features
This category of features generally relied on
the internal workings of the MT system and
described the processes involved in generating
the translation. It is also referred to as MT sys-
tem features. These features are given as follows.
Count of n-bests corresponding to each
source language sentence.
Word posterior probability.
Costs obtained by Moses such as language
model cost, distortion cost, weighted token
penalty cost, and unweighted token penalty
cost.
MT system output back translation. For this,
each back-translated sentence is scored
using BLEU, by treating source sentences as
the translation reference.
LEARNING METHODS
To develop the ranking model for sentence-
level translation quality ranking task, we rely on
the use of several powerful supervised machine
learning-based regression algorithms. In other
words, in an attempt to rank the translations as
a regression task, we try to build the models
which can assign a continuous value as the qual-
ity measure of the sentence. Mainly, we aim to
develop the ranking models which automatically
predict the rank (continuous values) in the range
let say [1:5], in a similar way as human do manu-
ally. In this context, some of these effective
regression algorithms used in the past and was
also proved successful by other researchers are
given as follows:
Partial least square regression;
Linear regression;
Lasso;
SVR; and
M5P.
EVALUATION MEASURES
The developed ranking model is evaluated by
computing the prediction errors over actual and
predicted values, and the correlations between
the model predicted rank and the manually
annotated rank. Particularly, for prediction
errors evaluation, we computed mean average
error (MAE), and root mean squared error
July/August 2020 25
Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 14:13:33 UTC from IEEE Xplore. Restrictions apply.
(RMSE), for a given instance idenoting the
actual target value, ^
yirepresenting the value
estimated by the ranking model. These evalua-
tion measures used to evaluate the performance
of learning models are briefly described below.
MAE: MAE (6) is used to measure the average
magnitude of the errors in a set of predictions
MAE ¼Pn
i¼1^
yiyi
jj
n:(6)
RMSE: RMSE (7) is a quadratic scoring rule use
to measure the average magnitude of the error
RMSE ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Pn
i¼1^
yiyi
ðÞ
2
n
s:(7)
Kendall tau rank correlations: This ranking eval-
uation measure (8) with nrepresenting the size
of each sample, Crepresenting a pair of concor-
dant (if their ranking agrees), Drepresenting a
discordant (if their ranking disagrees), and nei-
ther concordant nor discordant (if their rank-
ings are similar or equal) is defined as
t¼CD
1
2nn1ðÞ
:(8)
Spearman’s rank correlation: The Spearman’s
rank correlation (9) with Nrepresenting the
counts of alternative translations and direpre-
senting the difference in ranks annotated to a
translated sentence by two rankings is given by
r¼16PN
i¼1d2
i
NN
21ðÞ
:(9)
RESULTS
In this section, we have shown the experi-
mental results of our proposed ranking model.
Particularly, Table 1 shows the scores predicted
by our developed ranking model using the SVR
regression algorithm. From this result, we have
seen that google translator predicts the highest
quality score whereas, Moses-phrase based
translator predicts the lowest score, and there-
fore former is ranked as first (best) and later as
third (worst), respectively.
CONCLUSION
This article addressed the challenging task of
automatically ranking the translation outputs to
predict the quality of translation. The problem is
addressed as a supervised ML algorithm by
using a regression algorithm built on several fea-
tures. Later, correlations with the manual judg-
ments (rankings) show a success in developing a
mechanism that is used to obtain the translation
ranking based on their quality. Finally, the per-
formance of the followed mechanism is signifi-
cant and is remarkably higher, without any
access to the gold reference translation.
&REFERENCES
1. E. Westfall, “Legal implications of MT on-line,” in Proc.
2nd. Amta Conf., 1996, pp. 231–232.
2. K. W. Church and E. H. Hovy, “Good applications for
crummy machine translation,” Mach. Transl., vol. 8,
no. 4, pp. 239–258, 1993.
3. C. Callison-Burch, C. Fordyce, P. Koehn, C. Monz, and
J. Schroeder, “(Meta-) evaluation of machine
translation,” in Proc. 2nd Workshop Statist. Mach.
Transl., 2007, pp. 136–158.
4. Y. Ye, M. Zhou, and C. Y. Lin, “Sentence level machine
translation evaluation as a ranking problem: one step
aside from BLEU,” in Proc. 2nd Workshop Statist.
Mach. Transl., 2007, pp. 240–247.
5. K. Duh, “Ranking vs. regression in machine translation
evaluation,” in Proc. 3d Workshop Statist. Mach.
Transl., 2008, pp. 191–194.
6. A. V. Rosti, N. F. Ayan, B. Xiang, S. Matsoukas,
R. Schwartz, and B. Dorr, “Combining outputs from
multiplemachine translationsystems,” in Proc. Main
Conf. Human Lang. Technol., Conf. North Amer. Chapter
Assoc. Comput. Linguistics,2007, pp. 228–235.
Table 1. Performance evaluation of the developed ranking model.
Translation systems MAE RMSE Spearman’s correlation Kendall’s Tau Rank
Google 0.0698 0.1140 0.7881 0.6919 1
Bing 0.0759 0.1263 0.7554 0.6617 2
Moses-Phrase based 0.1507 0.2095 0.4255 0.3088 3
Artificial Intelligence
26 IT Professional
Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 14:13:33 UTC from IEEE Xplore. Restrictions apply.
7. L. Specia, D. Raj, and M. Turchi, “Machine translation
evaluation versus quality estimation,” Mach. Transl.,
vol. 24, no. 1, pp. 39–50, 2010.
8. R. Soricut and S. Narsale, “Combining quality
prediction and system selection for improved
automatic translation output,” in Proc. 7th Workshop
Statist. Mach. Transl., 2012, pp. 163–170.
9. E. Avramidis and M. Popovic, “Machine learning
methods for comparative and time-oriented quality
estimation of machine translation output,” in
Proc. 8th Workshop Statist. Mach. Transl., 2013,
pp. 329–336.
10. A. Tezcan, V. Hoste, B. Desmet, and L. Macken,
“UGENT-LT3 SCATE system for machine translation
quality estimation,” in Proc. 10th Workshop Statist.
Mach. Transl., 2015, pp. 353–360.
11. Z. Chen et al., “Improving machine translation quality
estimation with neural network features,” in Proc. 2nd
Conf. Mach. Transl., 2017, pp. 551–555.
12. T. Etchegoyhen, E. M. Garcia, and A. Azpeitia,
“Supervised and unsupervised minimalist quality
estimators: Vicomtech’s participation in the WMT 2018
quality estimation task,” in Proc. 3rd Conf. Mach.
Transl., Shared Task Papers, 2018, pp. 782–787.
13. P. Koehn, F. J. Och, and D. Marcu, “Statistical phrase-
based translation,” in Proc. Conf. North Amer. Chapter
Assoc. Comput. Linguistics Human Lang. Technol.,
2003, vol. 1, pp. 48–54.
14. M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and
J. Makhoul, “A study of translation edit rate with
targeted human annotation,” in Proc. 7th Conf. Assoc.
Mach. Transl. Amer., 2006, vol. 200, no. 6,
pp. 223–231.
15. P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and
R. L. Mercer, “The mathematics of statistical machine
translation: Parameter estimation,” Comput.
Linguistics, vol. 19, no. 2, pp. 263–311, 1993.
16. S. Petrov, L. Barrett, R. Thibaux, and D. Klein,
“Learning accurate, compact, and interpretable tree
annotation,” in Proc. 21st Int. Conf. Comput.
Linguistics 44th Annu. Meeting Assoc. Comput.
Linguistics, 2006, pp. 433–440.
17. S. Banerjee and A. Lavie, “METEOR: An automatic
metric for MT evaluation with improved correlation with
human judgments,” in Proc. ACL Workshop Intrinsic
Extrinsic Eval. Measures Mach. Transl. Summarization,
2005, pp. 65–72.
18. R. Soricut, N. Bach, and Z. Wang, “The SDL language
weaver systems in the WMT12 quality estimation
shared task,” in Proc. 7th Workshop Statist. Mach.
Transl., 2012, pp. 145–151.
Nivedita Bharti is currently a Full-Time Research
Scholar with Banasthali Vidyapith, Vanasthali, India.
Her research interests include the development of
models and methods for quality estimation of MT sys-
tems for Indian Languages, and has an interest in the
field of natural language processing, machine trans-
lation, machine learning, and deep learning. She
received the M.Tech. degree in computer science.
Contact her at nivedita2bharti@gmail.com.
Nisheeth Joshi is currently an Associate Profes-
sor with the Department of Computer Science,
Banasthali Vidyapith, Vanasthali, India. He primarily
works in the area of machine translation, information
retrieval, and cognitive computing. He has more than
12 years of teaching experience. He received the
Ph.D. degree in computer science and engineering
with specialization in evaluation of machine transla-
tion. He is a Life Member of the Computer Society
of India and the Institution of Electronics and
Telecommunications Engineers, India. Contact him
at jnisheeth@banasthali.in.
Iti Mathur is currently an Associate Professor with
the Department of Computer Science, Banasthali
Vidyapith, Vanasthali, India. She primarily works in
the field of information retrieval, ontology engineer-
ing, and machine translation. She has more than 15
years of experience in teaching and research. She
received the Ph.D. degree in computer science with
specialization in the area of ontologies. She is a Life
Member of the Computer Society of India. Contact
her at mathur_iti@rediffmail.com.
Pragya Katyayan is currently a Full-Time
Research Scholar with Banasthali Vidyapith, Vanas-
thali, India. Her research interest lies in the area of
machine translation, natural language processing,
information retrieval, and deep learning. Before join-
ing the Ph.D. programme, she received the Master of
Science degree in computer science and has
worked as a consultant on various projects based on
natural language processing. She is a Student Mem-
ber of the Computer Society of India. Contact her at
pragya.katyayan@outlook.com.
July/August 2020 27
Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 14:13:33 UTC from IEEE Xplore. Restrictions apply.
... According to the machine translation features of semantic information [7], a semantic syntax tree [8] is constructed to realize machine translation of English articles. e specific steps are as follows: ...
Article
Full-text available
At present, the existing methods of English article flip calibration neglect to extract English semantic features, which leads to errors in English flip results and has a great impact on the accuracy and time consumption of translation sentence calibration. Therefore, a semantic feature-based automatic text flipping calibration algorithm is proposed. According to the features of semantic information in machine translation, a semantic grammar tree is constructed to complete the machine turning of English articles. The CART decision tree attribute is obtained, and the random forest method is introduced to extract the input matrix and output matrix of the corpus feature as samples to determine the spatial attribute feature of the mistranslated sentences. Choose 10000 English sentences about human body parts as the experimental object and design the simulation experiment. The experimental results show that the minimum and maximum accuracy rates are 95.4% and 100.0%, respectively. The proposed algorithm is time-consuming, and the KSMR value is lower than that of the traditional method. It is proved that the error rate of English article flipping is significantly reduced.
Conference Paper
Full-text available
This paper describes the submission of the UGENT-LT3 SCATE system to the WMT15 Shared Task on Quality Estimation (QE), viz. English-Spanish word and sentence-level QE. We conceived QE as a supervised Machine Learning (ML) problem and designed additional features and combined these with the baseline feature set to estimate quality. The sentence-level QE system re-uses the word level predictions of the word-level QE system. We experimented with different learning methods and observe improvements over the baseline system for wordlevel QE with the use of the new features and by combining learning methods into ensembles. For sentence-level QE we show that using a single feature based on word-level predictions can perform better than the baseline system and using this in combination with additional features led to further improvements in performance.
Conference Paper
Full-text available
This paper describes a set of experi- ments on two sub-tasks of Quality Esti- mation of Machine Translation (MT) out- put. Sentence-level ranking of alternative MT outputs is done with pairwise classi- fiers using Logistic Regression with black- box features originating from PCFG Pars- ing, language models and various counts. Post-editing time prediction uses regres- sion models, additionally fed with new elaborate features from the Statistical MT decoding process. These seem to be better indicators of post-editing time than black- box features. Prior to training the models, feature scoring with ReliefF and Informa- tion Gain is used to choose feature sets of decent size and avoid computational com- plexity.
Article
Full-text available
This paper evaluates the translation quality of machine translation systems for 8 language pairs: translating French, German, Spanish, and Czech to English and back. We carried out an extensive human evaluation which allowed us not only to rank the different MT systems, but also to perform higher-level analysis of the evaluation process. We measured timing and intra- and inter-annotator agreement for three types of subjective evaluation. We measured the correlation of automatic evaluation metrics with human judgments. This meta-evaluation reveals surprising facts about the most commonly used methodologies.
Article
Full-text available
We describe a series of five statistical models of the translation process and give algorithms for estimating the parameters of these models given a set of pairs of sentences that are translations of one another. We define a concept of word-by-word alignment between such pairs of sentences. For any given pair of such sentences each of our models assigns a probability to each of the possible word-by-word alignments. We give an algorithm for seeking the most probable of these alignments. Although the algorithm is suboptimal, the alignment thus obtained accounts well for the word-by-word relationships in the pair of sentences. We have a great deal of data in French and English from the proceedings of the Canadian Parliament. Accordingly, we have restricted our work to these two languages; but we feel that because our algorithms have minimal linguistic content they would work well on other pairs of languages. We also feel, again because of the minimal linguistic content of our algorithms, that it is reasonable to argue that word-by-word alignments are inherent in any sufficiently large bilingual corpus.
Conference Paper
Full-text available
We present in this paper the system sub-missions of the SDL Language Weaver team in the WMT 2012 Quality Estimation shared-task. Our MT quality-prediction sys-tems use machine learning techniques (M5P regression-tree and SVM-regression models) and a feature-selection algorithm that has been designed to directly optimize towards the of-ficial metrics used in this shared-task. The resulting submissions placed 1st (the M5P model) and 2nd (the SVM model), respec-tively, on both the Ranking task and the Scor-ing task, out of 11 participating teams.
Conference Paper
The paper proposes formulating MT evaluation as a ranking problem, as is often done in the practice of assessment by human. Under the ranking scenario, the study also investigates the relative utility of several features. The results show greater correlation with human assessment at the sentence level, even when using an n-gram match score as a baseline feature. The feature contributing the most to the rank order correlation between automatic ranking and human assessment was the dependency structure relation rather than BLEU score and reference language model feature.
Article
This paper presents techniques for reference-free, automatic prediction of Machine Trans-lation output quality at both sentence-and document-level. In addition to helping with document-level quality estimation, sentence-level predictions are used for system selection, improving the quality of the output transla-tions. We present three system selection tech-niques and perform evaluations that quantify the gains across multiple domains and lan-guage pairs.
Article
We describe METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine- produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; further- more, METEOR can be easily extended to include more advanced matching strate- gies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference. We evaluate METEOR by measuring the cor- relation between the metric scores and human judgments of translation quality. We compute the Pearson R correlation value between its scores and human qual- ity assessments of the LDC TIDES 2003 Arabic-to-English and Chinese-to-English datasets. We perform segment-by- segment correlation, and show that METEOR gets an R correlation value of 0.347 on the Arabic data and 0.331 on the Chinese data. This is shown to be an im- provement on using simply unigram- precision, unigram-recall and their har- monic F1 combination. We also perform experiments to show the relative contribu- tions of the various mapping modules.