ArticlePDF Available

Quality-Based Ranking of Translation Outputs

July 2020
IT Professional 22(4):21-27

July 2020
22(4):21-27

DOI:10.1109/MITP.2020.2976009

Authors:

Nisheeth Joshi

Banasthali Vidyapith, India

Pragya Katyayan

Banasthali University

Translation ranking is inherently of great significance for machine translation (MT), as it allows the comparison of performances of multiple MT systems as well as for its efficient training. This article demonstrates a mechanism that is used for ranking the translation outputs generated by the MT systems from best to worst. To implement this approach, the system exploits a supervised learning algorithm trained over existing manual ranking by using various features obtained after the linguistic analysis of both source and target side sentences without relying on the reference translation.

Performance evaluation of the developed ranking model.

…

Figures - uploaded by Pragya Katyayan

Content may be subject to copyright.

Content uploaded by Pragya Katyayan

Content may be subject to copyright.

Quality-Based Ranking of

Translation Outputs

Nivedita Bharti, Nisheeth Joshi, Iti Mathur,

and Pragya Katyayan

Banasthali Vidyapith

Abstract—Translation ranking is inherently of great signiﬁcance for machine translation

(MT), as it allows the comparison of performances of multiple MT systems as well as for

its efﬁcient training. This article demonstrates a mechanism that is used for ranking the

translation outputs generated by the MT systems from best to worst. To implement this

approach, the system exploits a supervised learning algorithm trained over existing

manual ranking by using various features obtained after the linguistic analysis of both

source and target side sentences without relying on the reference translation.

&IN PRESENT DAYS,MT systems have achieved

signiﬁcant improvements. Although, the transla-

tion quality output generated from the translation

systems are neither consistent nor perfect across

multiple unseen test sentences. Due to this rea-

son, nowadays, researchers have been focusing

more on developing the techniques for estimating

the quality of translated text content and procur-

ing the translation performance indications in

real-time translation environment without any

human intervention (i.e., without accessing the

reference (correct) translations). The presented

article is focusing on solving this challenging task

of automatically ranking the alternative transla-

tion outputs from best to worst, obtained from

several MT systems corresponding to a given

source sentence. Commonly, this ranking task per-

formed manually by human judges (annotators)

has been acknowledged as a practice of evaluating

the translation outputs.

Additionally, in the scenario when several MT

systems are used in combination, then some sys-

tems can translate a given input source sentence

perfectly (correct) while some could not trans-

late well. In such a case, the best translation

selection could result in boosting performance.

Therefore, to deal with these kinds of issues, we

try to develop a translation ranking system by

exploiting the machine learning (ML) techniques

for ﬁnally imitating human behavior. In detail,

this automatic ranking translation system can

rank the various translation outputs generated

corresponding to a given source sentence, as

per their comparative quality. This framework is

Digital Object Identiﬁer 10.1109/MITP.2020.2976009

Date of current version 17 July 2020.

Theme Article: Artiﬁcial Intelligence

July/August 2020 Published by the IEEE Computer Society 1520-9202 ß2020 IEEE 21

Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 14:13:33 UTC from IEEE Xplore. Restrictions apply.

developed by using a regression algorithm,

which is trained by using the previously manu-

ally annotated ranks together with the numerous

qualitative criteria on the sentences (source and

target both).

RANKING PROBLEM DESCRIPTION

This submission is focusing on developing a

ranking system that is used to rank the alterna-

tive translation outputs similarly as a human

would rank. Particularly, this ranking system has

given multiple translation outputs correspond-

ing to a given source sentence. The multiple MT

systems have generated these alternative trans-

lation outputs for a given source sentence. In

other words, the main goal of this task is to rank

all those generated translations according to

their translation quality by considering numer-

ous qualitative measures over the translation

outputs. Mathematically, we describe a ranking

system in (1) using the three tuples as

Rsystem ¼S; T; R

fg (1)

where tuple Srepresent a given source sentence

with the corresponding set of translations as

represented by tuple T¼ft1;t

2;t

3;...;t

ngwith tk

is the kth translation corresponding to the

source side sentence S, and nis the total count

of generated translations.

Here, every individual translation in the

translations set Tis allied with an ordinal list

of ranks (judgments) represented by tuple

R¼fr1;r

2;r

3;...;r

mg. Where, rkcorresponds to

the rank given to the translation tkcompared to

other alternative translations present in the set

T. From this point, it is getting clear that this

qualitative ranking mechanism does not indicate

any generic or absolute quality measure. As the

translation ranking is done at the sentence level,

so this reveals that at a time, the inherent mech-

anism is focusing on a single sentence by consid-

ering the alternative translations and ﬁnally

taking a decision about the translation quality.

Thus, any annotated rank carry a meaning for

the sentence in consideration only, and the corre-

spondingly generated multiple translations. In

particular, every source sentence Sjis related

with a translation set Tj¼ðtj

1;t

2;t

3;...;t

nÞ.

Where, tj

iis the ith translation of the jth sentence,

and nis the translation count. While, each trans-

lation list is associated with a list which contains

the relative rankings Rj¼ðrj

1;r

2;r

3;...;r

mÞ, and

kcorresponds to the ranking on the kth transla-

tion of the jth source sentence. Finally, one

important point to note here is that in the sce-

nario, if the quality of two translation outputs is

similar, then such a case is called a tie, and the

same rank would be assigned to both the transla-

tion candidates.

NEED OF RANKING SYSTEM

There is a need for developing a ranking sys-

tem because the translation industry requires

more transparency regarding MT systems’

strengths and weaknesses. Surprisingly, in this

wider world, the professional community nowa-

days has focused on the discussions of the serv-

ices provided by the MT systems and its effects

on commercial software. Some of these impor-

tant needs are given as follows.

Ranking imparts to the deﬁnition of translation

quality: Enhancing the translation quality is

essential for providers to ignore the viable

lawsuits. Since a ﬂaw in translation ﬁnally

leads to safety violation.

The ranking gives a set of criteria and rationale

for translation systems evaluation: According

to Church & Hovy, “it should be interpretable

that what an MT system cannot and can do.”

Particularly, in the scenario when the transla-

tion services are used by the general public

and are applicable at large scale.

Ranking responds to the demands of consum-

ers for effortlessly accountable information

about the translation systems quality: Since

ranking, the accuracy (usefulness) of the

translation systems could help the end-user

in deciding about which translation systems

suit their purpose.

RELATED WORK

The concept of translation ranking has been

employed in various MT related tasks. Initially, in

WMT07 evaluation task, the ranking concept was

used for evaluating the machine translation

(MT).

In this direction, previous some contribu-

tions based on the ML approach are proposed for

Artiﬁcial Intelligence

22 IT Professional

Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 14:13:33 UTC from IEEE Xplore. Restrictions apply.

training on the rank data.

4,5

Although, reference

translations were used for these approaches, and

were evaluated only by obtaining an overall rank-

ing of corpus-level.

On the contrary, focusing on our reference-

free approaches, Rosti et al.

performed transla-

tion selection by using generalized models at the

sentence level. Particularly, they exploited

the re-ranking of N-best lists combined from

multiple translation systems to comparatively

rank the translation outputs. Specia et al.

devel-

oped one translation quality prediction model

corresponding to every translation system. After

that, the scores predicted by these individual

models are used for ranking each alternative

candidate translations of the corresponding

same source language sentence.

Later, in this context, Soricut and Narsle

employed ML to rank the multiple alternative

translations and ﬁnally selected the highest

ranked translation output. Subsequently, Avra-

midis and Popovic

performed ranking of alter-

native translation outputs using logistic

regression as pairwise classiﬁers with black-box

features. Tezcan et al.

trained the regression

model using baseline features in combination

with the word level predictions as features to

develop the ranking model at sentence-level.

Later, Chen et al.

extracted neural features and

the cross-entropy features for training the SVR

model to build the ranking model ﬁnally. Etche-

goyhen et al.

employed a minimalist approach

for sentence-level ranking model development.

The method used by the authors named as

minimalist because it required few resources

and minimal deployment efforts.

APPROACH

The sentence-level ranking of alternative

translations has been addressed as a typical

supervised machine learning problem, as shown

in Figure 1. The steps involved in implementing

this approach are described as: ﬁrst, alternative

candidate translations corresponding to a single

source sentence from a given input data corpus

are generated at a time by inputting it to multiple

MT systems. Second, we develop a feature

extraction module that extracts a feature vector

by analyzing the source sentence, translation

outputs, and the translation process. Third, the

obtained feature vectors with manually anno-

tated quality ranks referred to as training instan-

ces are given to the ML algorithm for developing

the ranking model. Finally, this ranking model

predicts the ranks on an unseen test set.

DATASETS AND MT SYSTEM

We have collected 5000 sentences from the

tourism domain for the development of our pro-

posed model on the English–Hindi language pair.

The dataset is freely available on “Technology

Development for Indian Languages” website. Fur-

ther, we split this dataset into training and test

set in the ratio 80% for training and 20% to test

the ranking model.

Subsequently, we have used three MT systems:

Google, Bing, and Moses Phrase-based Model

Figure 1. Ranking system architecture.

July/August 2020 23

Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 14:13:33 UTC from IEEE Xplore. Restrictions apply.

obtain the alternative candidate translations

corresponding to a given source sentence. One

example of the alternative candidate translations

generated by the introduced translation systems,

and their reference translation, corresponding to

a given source sentence across the test set is given

as follows.

Source Sentence: The government of India ofﬁce

has more information on other destinations

as well.

QUALITY SCORE

The translated outputs corresponding to the

given source text are annotated manually on a

ﬁve-point scale using the human translation edit

rate

score to generate the training sample for

the ranking model. The descriptions about the

ﬁve-point scale used for annotating the quality

of the translation outputs are given using the fol-

lowing scoring scheme:

1 – the translation is intelligible and perfect;

2 – the translation is generally intelligible and

clear but it requires minor correction;

3 – translation needs a signiﬁcant editing

effort so that it reach the level of publishing;

4 – the translation contains different errors

and mistranslations that requires major

correction;

5 – the translation is very poor.

FEATURE EXTRACTION

We used the feature extraction module to

obtain a feature vector, which indicates the

translation quality. This feature vector is repre-

sented mathematically in (2), and is extracted

for every source and its translations pair ðSi;T

iÞ

with i¼1, 2, 3, ...,n

fiðÞ¼GS

i;T

ðÞ:(2)

Here, Gis a feature generation function that

extracts the feature vector, given a single source

sentence and its corresponding translation out-

puts. In this case, each particular feature vector

fðiÞwhich is obtained from the ith source sen-

tence, and their corresponding ranks list deﬁnes

a training instance as given in (3). Furthermore,

a training example set containing Ninstances is

formulated as given in (4)

IiðÞ ¼fi;r

ðÞ (3)

T¼fi;r

ðÞ

i¼1:(4)

Finally, given a training example set, the goal

of the learning algorithm is to deﬁne a ranking

function that minimizes the total error between

the predicted ranks list as given in (5) with rank-

ing function predicts a list of ranks b

ri, given a fea-

ture vector fðiÞ

i¼1

error ri;b

ðÞ:(5)

Mainly, in this article, we used two feature sets,

namely: black-box and glass-box, in order to

extract the features indicating the translation out-

put quality, by using various linguistic analysis

tools for analyzing the source languge sentence,

the alternative target translations, andthe aspects

of the translation process. According to their ori-

gin, these feature sets are described as follows.

Black-Box Features

The black-box features are obtained by auto-

matically analyzing the source and target senten-

ces both. The black-box feature set is further

categorized as follows.

Surface features: These kinds of features are

simple and are used for accounting the difﬁ-

culty of the translation task by merely analyz-

ing the source and target sentences. It

includes the tokens (words) count present in

both source and target sentences, unknown

words count, and the average characters

count computed per token, sentence length,

and source to target length ratio.

Target LM-based scores: This LM is an indica-

tion of ﬂuency and plausibility of the target

sentence. Since it gives statistics about the

Artiﬁcial Intelligence

24 IT Professional

Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 14:13:33 UTC from IEEE Xplore. Restrictions apply.

correctness of the word’s sequences for a

speciﬁc language. This category mainly cov-

ers features like smoothed unigram, bigram,

and trigram probability of the target lan-

guage sentence. Also, the unigram, bigram,

and trigram perplexity of the target senten-

ces are considered under this category.

IBM Model1 scores: IBM1 model

score is

based on a bag of word translation model,

which is used to measure the quality of

association of all feasible alignment probabil-

ities between the tokens of the source

sentence and the target side sentences.

This category includes the scores in both

directions.

Parsing based features: This category uses the

features obtained from the PCFG parsing

for

source and target sentence both. These fea-

tures cover more complex phenomena like

long-distance structures and grammatical ﬂu-

ency. The PCFG parsing works by producing

numerous possible parse trees correspond-

ing to a given input sentence, which resulted

in producing an n-best list of parse candi-

dates. These kinds of features include: count

of n-best trees generated, log-likelihood of

parse trees and best parse tree conﬁdence.

Shallow grammatical match counts: To obtain

adequacy features, the similar or same gram-

matical structures must occur on the source

and target translations both. This category

covers the occurrences of the basic node

labels of the PCFG parse tree on both source

and target sentences. Particularly, it includes:

Nouns, Verbs, NPs, VPs, PPs, and subordinate

clauses.

Source complexity based features: It includes

the features such as average source sentence

token length, average count of translations per

token present in the source sentence, percent-

age of 1-gram, 2-grams, 3-grams, 4-grams in

lower frequency quartile and higher frequency

quartile present in a corpus of the source sen-

tence, percentage of 1-gram, 2-grams,

3-grams, 4-grams of source sentence present

in a given corpus.

Contrastive scoring: Each target translation is

scored with automatic evaluation metrics

(such as METEOR)

as a feature by using alter-

native translations as reference translations.

Glass-Box Features

This category of features generally relied on

the internal workings of the MT system and

described the processes involved in generating

the translation. It is also referred to as MT sys-

tem features. These features are given as follows.

Count of n-bests corresponding to each

source language sentence.

Word posterior probability.

Costs obtained by Moses such as language

model cost, distortion cost, weighted token

penalty cost, and unweighted token penalty

cost.

MT system output back translation. For this,

each back-translated sentence is scored

using BLEU, by treating source sentences as

the translation reference.

LEARNING METHODS

To develop the ranking model for sentence-

level translation quality ranking task, we rely on

the use of several powerful supervised machine

learning-based regression algorithms. In other

words, in an attempt to rank the translations as

a regression task, we try to build the models

which can assign a continuous value as the qual-

ity measure of the sentence. Mainly, we aim to

develop the ranking models which automatically

predict the rank (continuous values) in the range

let say [1:5], in a similar way as human do manu-

ally. In this context, some of these effective

regression algorithms used in the past and was

also proved successful by other researchers are

given as follows:

Partial least square regression;

Linear regression;

Lasso;

SVR; and

M5P.

EVALUATION MEASURES

The developed ranking model is evaluated by

computing the prediction errors over actual and

predicted values, and the correlations between

the model predicted rank and the manually

annotated rank. Particularly, for prediction

errors evaluation, we computed mean average

error (MAE), and root mean squared error

July/August 2020 25

Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 14:13:33 UTC from IEEE Xplore. Restrictions apply.

(RMSE), for a given instance idenoting the

actual target value, ^

yirepresenting the value

estimated by the ranking model. These evalua-

tion measures used to evaluate the performance

of learning models are brieﬂy described below.

MAE: MAE (6) is used to measure the average

magnitude of the errors in a set of predictions

MAE ¼Pn

i¼1^

yiyi

n:(6)

RMSE: RMSE (7) is a quadratic scoring rule use

to measure the average magnitude of the error

RMSE ¼ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i¼1^

yiyi

ðÞ

s:(7)

Kendall tau rank correlations: This ranking eval-

uation measure (8) with nrepresenting the size

of each sample, Crepresenting a pair of concor-

dant (if their ranking agrees), Drepresenting a

discordant (if their ranking disagrees), and nei-

ther concordant nor discordant (if their rank-

ings are similar or equal) is deﬁned as

t¼CD

2nn1ðÞ

:(8)

Spearman’s rank correlation: The Spearman’s

rank correlation (9) with Nrepresenting the

counts of alternative translations and direpre-

senting the difference in ranks annotated to a

translated sentence by two rankings is given by

r¼16PN

i¼1d2

21ðÞ

:(9)

RESULTS

In this section, we have shown the experi-

mental results of our proposed ranking model.

Particularly, Table 1 shows the scores predicted

by our developed ranking model using the SVR

regression algorithm. From this result, we have

seen that google translator predicts the highest

quality score whereas, Moses-phrase based

translator predicts the lowest score, and there-

fore former is ranked as ﬁrst (best) and later as

third (worst), respectively.

CONCLUSION

This article addressed the challenging task of

automatically ranking the translation outputs to

predict the quality of translation. The problem is

addressed as a supervised ML algorithm by

using a regression algorithm built on several fea-

tures. Later, correlations with the manual judg-

ments (rankings) show a success in developing a

mechanism that is used to obtain the translation

ranking based on their quality. Finally, the per-

formance of the followed mechanism is signiﬁ-

cant and is remarkably higher, without any

access to the gold reference translation.

&REFERENCES

1. E. Westfall, “Legal implications of MT on-line,” in Proc.

2nd. Amta Conf., 1996, pp. 231–232.

2. K. W. Church and E. H. Hovy, “Good applications for

crummy machine translation,” Mach. Transl., vol. 8,

no. 4, pp. 239–258, 1993.

3. C. Callison-Burch, C. Fordyce, P. Koehn, C. Monz, and

J. Schroeder, “(Meta-) evaluation of machine

translation,” in Proc. 2nd Workshop Statist. Mach.

Transl., 2007, pp. 136–158.

4. Y. Ye, M. Zhou, and C. Y. Lin, “Sentence level machine

translation evaluation as a ranking problem: one step

aside from BLEU,” in Proc. 2nd Workshop Statist.

Mach. Transl., 2007, pp. 240–247.

5. K. Duh, “Ranking vs. regression in machine translation

evaluation,” in Proc. 3d Workshop Statist. Mach.

Transl., 2008, pp. 191–194.

6. A. V. Rosti, N. F. Ayan, B. Xiang, S. Matsoukas,

R. Schwartz, and B. Dorr, “Combining outputs from

multiplemachine translationsystems,” in Proc. Main

Conf. Human Lang. Technol., Conf. North Amer. Chapter

Assoc. Comput. Linguistics,2007, pp. 228–235.

Table 1. Performance evaluation of the developed ranking model.

Translation systems MAE RMSE Spearman’s correlation Kendall’s Tau Rank

Google 0.0698 0.1140 0.7881 0.6919 1

Bing 0.0759 0.1263 0.7554 0.6617 2

Moses-Phrase based 0.1507 0.2095 0.4255 0.3088 3

Artiﬁcial Intelligence

26 IT Professional

Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 14:13:33 UTC from IEEE Xplore. Restrictions apply.

7. L. Specia, D. Raj, and M. Turchi, “Machine translation

evaluation versus quality estimation,” Mach. Transl.,

vol. 24, no. 1, pp. 39–50, 2010.

8. R. Soricut and S. Narsale, “Combining quality

prediction and system selection for improved

automatic translation output,” in Proc. 7th Workshop

Statist. Mach. Transl., 2012, pp. 163–170.

9. E. Avramidis and M. Popovic, “Machine learning

methods for comparative and time-oriented quality

estimation of machine translation output,” in

Proc. 8th Workshop Statist. Mach. Transl., 2013,

pp. 329–336.

10. A. Tezcan, V. Hoste, B. Desmet, and L. Macken,

“UGENT-LT3 SCATE system for machine translation

quality estimation,” in Proc. 10th Workshop Statist.

Mach. Transl., 2015, pp. 353–360.

11. Z. Chen et al., “Improving machine translation quality

estimation with neural network features,” in Proc. 2nd

Conf. Mach. Transl., 2017, pp. 551–555.

12. T. Etchegoyhen, E. M. Garcia, and A. Azpeitia,

“Supervised and unsupervised minimalist quality

estimators: Vicomtech’s participation in the WMT 2018

quality estimation task,” in Proc. 3rd Conf. Mach.

Transl., Shared Task Papers, 2018, pp. 782–787.

13. P. Koehn, F. J. Och, and D. Marcu, “Statistical phrase-

based translation,” in Proc. Conf. North Amer. Chapter

Assoc. Comput. Linguistics Human Lang. Technol.,

2003, vol. 1, pp. 48–54.

14. M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and

J. Makhoul, “A study of translation edit rate with

targeted human annotation,” in Proc. 7th Conf. Assoc.

Mach. Transl. Amer., 2006, vol. 200, no. 6,

pp. 223–231.

15. P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and

R. L. Mercer, “The mathematics of statistical machine

translation: Parameter estimation,” Comput.

Linguistics, vol. 19, no. 2, pp. 263–311, 1993.

16. S. Petrov, L. Barrett, R. Thibaux, and D. Klein,

“Learning accurate, compact, and interpretable tree

annotation,” in Proc. 21st Int. Conf. Comput.

Linguistics 44th Annu. Meeting Assoc. Comput.

Linguistics, 2006, pp. 433–440.

17. S. Banerjee and A. Lavie, “METEOR: An automatic

metric for MT evaluation with improved correlation with

human judgments,” in Proc. ACL Workshop Intrinsic

Extrinsic Eval. Measures Mach. Transl. Summarization,

2005, pp. 65–72.

18. R. Soricut, N. Bach, and Z. Wang, “The SDL language

weaver systems in the WMT12 quality estimation

shared task,” in Proc. 7th Workshop Statist. Mach.

Transl., 2012, pp. 145–151.

Nivedita Bharti is currently a Full-Time Research

Scholar with Banasthali Vidyapith, Vanasthali, India.

Her research interests include the development of

models and methods for quality estimation of MT sys-

tems for Indian Languages, and has an interest in the

ﬁeld of natural language processing, machine trans-

lation, machine learning, and deep learning. She

received the M.Tech. degree in computer science.

Contact her at nivedita2bharti@gmail.com.

Nisheeth Joshi is currently an Associate Profes-

sor with the Department of Computer Science,

Banasthali Vidyapith, Vanasthali, India. He primarily

works in the area of machine translation, information

retrieval, and cognitive computing. He has more than

12 years of teaching experience. He received the

Ph.D. degree in computer science and engineering

with specialization in evaluation of machine transla-

tion. He is a Life Member of the Computer Society

of India and the Institution of Electronics and

Telecommunications Engineers, India. Contact him

at jnisheeth@banasthali.in.

Iti Mathur is currently an Associate Professor with

the Department of Computer Science, Banasthali

Vidyapith, Vanasthali, India. She primarily works in

the ﬁeld of information retrieval, ontology engineer-

ing, and machine translation. She has more than 15

years of experience in teaching and research. She

received the Ph.D. degree in computer science with

specialization in the area of ontologies. She is a Life

Member of the Computer Society of India. Contact

her at mathur_iti@rediffmail.com.

Pragya Katyayan is currently a Full-Time

Research Scholar with Banasthali Vidyapith, Vanas-

thali, India. Her research interest lies in the area of

machine translation, natural language processing,

information retrieval, and deep learning. Before join-

ing the Ph.D. programme, she received the Master of

Science degree in computer science and has

worked as a consultant on various projects based on

natural language processing. She is a Student Mem-

ber of the Computer Society of India. Contact her at

pragya.katyayan@outlook.com.

July/August 2020 27

Authorized licensed use limited to: University of Canberra. Downloaded on July 19,2020 at 14:13:33 UTC from IEEE Xplore. Restrictions apply.

Automatic Calibration Algorithm for English Text Translation Based on Semantic Features

Article

Full-text available

Sep 2021

Fengzhen Liu

At present, the existing methods of English article flip calibration neglect to extract English semantic features, which leads to errors in English flip results and has a great impact on the accuracy and time consumption of translation sentence calibration. Therefore, a semantic feature-based automatic text flipping calibration algorithm is proposed. According to the features of semantic information in machine translation, a semantic grammar tree is constructed to complete the machine turning of English articles. The CART decision tree attribute is obtained, and the random forest method is introduced to extract the input matrix and output matrix of the corpus feature as samples to determine the spatial attribute feature of the mistranslated sentences. Choose 10000 English sentences about human body parts as the experimental object and design the simulation experiment. The experimental results show that the minimum and maximum accuracy rates are 95.4% and 100.0%, respectively. The proposed algorithm is time-consuming, and the KSMR value is lower than that of the traditional method. It is proved that the error rate of English article flipping is significantly reduced.

An Automatic Assessment and Optimization Algorithm for English Translation Software Combining Deep Learning and Natural Language Processing

Conference Paper

Feb 2024

Yan Lin

Construction and Optimization of Bilingual Translation Corpus Based on GLR Algorithm

Conference Paper

Jul 2023

Xia Yu

Translation Processing Model of Complex Long Sentences Based on Feature Extraction Algorithm

Conference Paper

Aug 2023

Jia Liu

Improving Machine Translation Quality Estimation with Neural Network Features

Conference Paper

Full-text available

Jan 2017

UGENT-LT3 SCATE System for Machine Translation Quality Estimation

Conference Paper

Full-text available

Jan 2015

This paper describes the submission of the UGENT-LT3 SCATE system to the WMT15 Shared Task on Quality Estimation (QE), viz. English-Spanish word and sentence-level QE. We conceived QE as a supervised Machine Learning (ML) problem and designed additional features and combined these with the baseline feature set to estimate quality. The sentence-level QE system re-uses the word level predictions of the word-level QE system. We experimented with different learning methods and observe improvements over the baseline system for wordlevel QE with the use of the new features and by combining learning methods into ensembles. For sentence-level QE we show that using a single feature based on word-level predictions can perform better than the baseline system and using this in combination with additional features led to further improvements in performance.

Machine learning methods for comparative and time-oriented Quality Estimation of Machine Translation output

Conference Paper

Full-text available

Aug 2013

This paper describes a set of experi- ments on two sub-tasks of Quality Esti- mation of Machine Translation (MT) out- put. Sentence-level ranking of alternative MT outputs is done with pairwise classi- fiers using Logistic Regression with black- box features originating from PCFG Pars- ing, language models and various counts. Post-editing time prediction uses regres- sion models, additionally fed with new elaborate features from the Statistical MT decoding process. These seem to be better indicators of post-editing time than black- box features. Prior to training the models, feature scoring with ReliefF and Informa- tion Gain is used to choose feature sets of decent size and avoid computational com- plexity.

(Meta-) evaluation of machine translation

Article

Full-text available

Jun 2007

This paper evaluates the translation quality of machine translation systems for 8 language pairs: translating French, German, Spanish, and Czech to English and back. We carried out an extensive human evaluation which allowed us not only to rank the different MT systems, but also to perform higher-level analysis of the evaluation process. We measured timing and intra- and inter-annotator agreement for three types of subjective evaluation. We measured the correlation of automatic evaluation metrics with human judgments. This meta-evaluation reveals surprising facts about the most commonly used methodologies.

The Mathematics of Statistical Machine Translation: Parameter Estimation

Article

Full-text available

Jan 1993
COMPUT LINGUIST

We describe a series of five statistical models of the translation process and give algorithms for estimating the parameters of these models given a set of pairs of sentences that are translations of one another. We define a concept of word-by-word alignment between such pairs of sentences. For any given pair of such sentences each of our models assigns a probability to each of the possible word-by-word alignments. We give an algorithm for seeking the most probable of these alignments. Although the algorithm is suboptimal, the alignment thus obtained accounts well for the word-by-word relationships in the pair of sentences. We have a great deal of data in French and English from the proceedings of the Canadian Parliament. Accordingly, we have restricted our work to these two languages; but we feel that because our algorithms have minimal linguistic content they would work well on other pairs of languages. We also feel, again because of the minimal linguistic content of our algorithms, that it is reasonable to argue that word-by-word alignments are inherent in any sufficiently large bilingual corpus.

The SDL Language Weaver Systems in the WMT12 Quality Estimation Shared Task

Conference Paper

Full-text available

Jun 2012

We present in this paper the system sub-missions of the SDL Language Weaver team in the WMT 2012 Quality Estimation shared-task. Our MT quality-prediction sys-tems use machine learning techniques (M5P regression-tree and SVM-regression models) and a feature-selection algorithm that has been designed to directly optimize towards the of-ficial metrics used in this shared-task. The resulting submissions placed 1st (the M5P model) and 2nd (the SVM model), respec-tively, on both the Ranking task and the Scor-ing task, out of 11 participating teams.

Supervised and Unsupervised Minimalist Quality Estimators: Vicomtech’s Participation in the WMT 2018 Quality Estimation Task

Conference Paper

Jan 2018

Sentence level machine translation evaluation as a ranking problem

Conference Paper

Jan 2007

The paper proposes formulating MT evaluation as a ranking problem, as is often done in the practice of assessment by human. Under the ranking scenario, the study also investigates the relative utility of several features. The results show greater correlation with human assessment at the sentence level, even when using an n-gram match score as a baseline feature. The feature contributing the most to the rank order correlation between automatic ranking and human assessment was the dependency structure relation rather than BLEU score and reference language model feature.

Combining Quality Prediction and System Selection for Improved Automatic Translation Output

Article

This paper presents techniques for reference-free, automatic prediction of Machine Trans-lation output quality at both sentence-and document-level. In addition to helping with document-level quality estimation, sentence-level predictions are used for system selection, improving the quality of the output transla-tions. We present three system selection tech-niques and perform evaluations that quantify the gains across multiple domains and lan-guage pairs.

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

Article

Jan 2005

We describe METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine- produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; further- more, METEOR can be easily extended to include more advanced matching strate- gies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference. We evaluate METEOR by measuring the cor- relation between the metric scores and human judgments of translation quality. We compute the Pearson R correlation value between its scores and human qual- ity assessments of the LDC TIDES 2003 Arabic-to-English and Chinese-to-English datasets. We perform segment-by- segment correlation, and show that METEOR gets an R correlation value of 0.347 on the Arabic data and 0.331 on the Chinese data. This is shown to be an im- provement on using simply unigram- precision, unigram-recall and their har- monic F1 combination. We also perform experiments to show the relative contribu- tions of the various mapping modules.

Quality-Based Ranking of Translation Outputs

Abstract and Figures

Recommended publications

Efficient data selection for machine translation

Meteor, m-bleu and m-ter: Evaluation metrics for high-correlation with human rankings of machine tra...

Sentence-level ranking with quality estimation

Ranking vs. regression in machine translation evaluation

Human and Automatic Evaluation of English to Hindi Machine Translation Systems