ChapterPDF Available

Old Needs, New Solutions: Comparable Corpora for Language Professionals

Authors:

Abstract and Figures

Use of corpora by language service providers and language professionals remains limited due to the existence of competing resources that are likely to be perceived as less demanding in terms of time and effort required to obtain and (learn to) use them (e.g. translation memory software, term bases and so forth). These resources however have limitations that could be compensated for through the integration of comparable corpora and corpus building tools in the translator’s toolkit. This chapter provides an overview of the ways in which different types of comparable corpora can be used in translation teaching and practice. First, two traditional corpus typologies are presented, namely small and specialized “handmade” corpora collected by end-users themselves for a specific task, and large and general “manufactured” corpora collected by expert teams and made available to end users. We suggest that striking a middleground between these two opposites is vital for professional uptake. To this end, we show how the BootCaT toolkit can be used to construct largish and relatively specialized comparable corpora for a specific translation task, and how, varying the search parameters in very simple ways, the size and usability of the corpora thus constructed can be further increased. The process is exemplified with reference to a simulated task (the translation of a patient information leaflet from English into Italian) and its efficacy is evaluated through an end-user questionnaire.
Content may be subject to copyright.
Old needs, new solutions
Comparable corpora for language professionals
Silvia Bernardini1and Adriano Ferraresi2
1SITLeC Department (University of Bologna)
C.so Diaz 64, 47100 Forl`ı (FC). ITALY.
Tel.: +390543374736
silvia.bernardini@unibo.it
2Department of Theories and Methods
of Human and Social Sciences (University of Naples “Federico II”)
V. Rodin`o 22, 80138 Naples (NA). ITALY.
1 Introduction
Language professionals make ample use of technology. A recent survey estimated
that EU language service providers invest 5-10% of the annual turnover in mul-
tilingual technology tools, with a likely market size topping one billion euros
in 2015 [30]. These tools include electronic dictionaries, terminology extraction
tools and translation memory (TM) tools as well as language training software.
While corpora and corpus processing software are only mentioned in passing,
and not focused upon explicitly, they would clearly belong here, since they can
and indeed have been used to provide (or add to) dictionary-like insights about
word meaning and use [22], to extract terminology [27] [7], to produce contex-
tually appropriate target texts through concordance browsing [36] [8] and to
improve competence in a foreign language [19]. There is no doubting that many
academics who teach translation and foreign languages believe in the merits of
corpus work, particularly of the comparable kind.
Free-lance translators also appear to be aware that they need corpora –
though they do not call them such. A survey conducted in 2005-2006 among
European (mainly British) translators and students of translation [24] showed
that over half the respondents (a total of 1,015) collected reference texts, but in
the majority of cases either read them or searched them using word processing
search facilities. Very few were aware of corpus query tools, but very many
(around 80%) claimed they would be interested in a service providing specialized
corpora and/or extracting terms from them.
Lastly, some commercial TM software vendors are beginning to see the point
of (slightly) more sophisticated TM concordancing functionalities, which would
bridge the gap between TMs and aligned parallel corpora. The Canadian Mul-
tiCorpora offers concordancing in full-text context as a selling point for its soft-
ware, overcoming a common problem with competitor tools, where “the con-
cordance passage is limited to isolated sentences that exist in the TM database
[so] there is often insufficient context to provide guidance on the applicability
of the found result” [25]. The latest release from market leader SDL TRADOS
2
(Studio 2009) incorporates “character-based” concordancing to retrieve related
words or “fuzzy matches” (a little like a regular expression search would), and
allows searches on the target text as well as the source text. Inefficient TM con-
cordancing is also perceived as a problem by end users, judging from complaints
in user forums.
And yet use of corpora by language service providers and language profes-
sionals remains limited. The lack of widespread uptake is probably due to the
existence of competing resources that are likely to be perceived as less demanding
in terms of time and effort required to obtain and (learn to) use them. The trans-
lator’s toolkit has never been so replete with tools and resources, from termbases
and dictionaries to the Web itself, from TMs to aligned or unaligned parallel or
“bi-” texts. Each of these have a role to play in the translation process, and
indeed expert translators resort to them depending on task requirements, the
ability to choose quickly and confidently among different resources being one
of hallmarks of translation competence [14]. And they all have advantages and
disadvantages.
Dictionaries and termbases, particularly electronic ones, provide a wealth
of “digested” information (equivalents, definitions, synonyms, typical examples)
sanctioned by lexicographers and terminographers; searches are quick and easy,
and solutions are often found, though they may lack the reassuring added value
of a “contextual match” [35]. One obvious way of overcoming this drawback is to
consult (tens or hundreds of) actual texts on the web, and draw inferences based
on actual use in context. This process is more time consuming, effortful and com-
plex than dictionary lookup, though, requiring the opening and closing of multi-
ple pages, the quick evaluation of the reliability of sources, and an acceptance of
the many limits of search engines that were not designed for linguists [3]. TMs
are undoubtedly valuable productivity-enhancing resources, providing authori-
tative equivalents in context from previously translated texts. And yet TMs are
hardly the translator’s panacea. First, they are only really useful for certain text
types (repetitive texts with shortish sentences) And tasks (revisions, updates).
Secondly, it has been suggested that they might affect translators’ strategies in a
negative way, by making recyclability a priority rather than a positive side effect
[6]. By preserving sentence boundaries and avoiding variation, translators may
increase the amount of leveraging from one task to the next, but not necessarily
produce the “best possible” translation. Lastly, aligned parallel texts/corpora
are not available ready-made for most specialised domains3and their set up is
time-consuming and technically demanding for translators.
In the next Section (2) we will suggest that some of these limitations can
be overcome through the integration of comparable corpora and corpus building
tools in the translator’s toolkit. We shall start by providing an overview of the
ways in which different types of comparable corpora can be used in translation
3This may change in the future, as more tools like Linguee (http://http://www.
linguee.com/) provide access to the aligned Web, and possibly to subsections of it.
3
teaching and practice.4We shall first look at two “traditional” corpus typolo-
gies widely discussed in the literature: small and specialized “handmade” corpora
collected by end-users themselves for a specific task, and large and general “man-
ufactured” corpora collected by expert teams and made available to end users.
These occupy opposite poles of a cline going from very small and fine-tuned
corpora to very large and general ones. Since besides the advantages both have
disadvantages that are likely to make translators shun them, we shall suggest
that striking a middleground between these two opposites is vital. To this end, in
Section 3 we shall see how the BootCaT toolkit can be used to construct largish
and relatively specialized comparable corpora for a specific translation task, and
how, varying the search parameters in very simple ways, the size and usability of
the corpora thus constructed can be further increased. Usability is evaluated by
respondents and exemplified with reference to a simulated task (translation of a
patient information leaflet from English into Italian). In Section 4 we conclude
by summarizing our argument and looking at future prospects.
2 Comparable corpora as alternatives?
2.1 The handmade solution
Compared to other resources, comparable corpora do offer a number of advan-
tages. Unlike in the case of web search engines and TMs, consultation of corpora
through dedicated software allows translators to benefit from the querying and
displaying facilities which are at the very heart of corpus-related methodolo-
gies. Linguistically sophisticated queries, e.g. through the use of wildcard and
regular expressions, together with the possibility to sort results according to co-
text, undoubtedly give corpora an edge in terms of time and effort required to
make patterns in context emerge. Once a corpus has been compiled, translators
no longer need to browse through different pages, like they would when using
search engines; they can specify rules to target specific linguistic forms (as well
as other querying options, e.g. setting the desired number of “empty” words in
an expression, etc.), and can variably order results according to the pattern they
are searching for: no commercial TM tool that we are aware of implements all
these features. Lastly, most concordancers allow full-text browsing, thus offer-
ing the rich contextual information required for decision-making, that TM tools
seem to have so much trouble providing [9].5
4As suggested in the previous section, in this chapter we are not specifically discussing
aligned parallel corpora. In our view these are more akin to TMs than to comparable
corpora in terms of the technical issues involved in their construction and consulta-
tion, and of the type of insights translators can obtain from them; they are therefore
not directly relevant here.
5More “advanced” corpus querying techniques, like extraction of keywords or com-
putation of collocational scores can of course be of great interest to translators.
However, their relevance and usefulness may be hard to grasp for less corpus-savvy
users, and hence they are not discussed here.
4
Other advantages which have been suggested to be offered by comparable
corpora for translation purposes depend on the “type” of corpus considered.
Two kinds of corpora are particularly relevant for the present discussion, namely
those that position themselves at the extremes of the “size cline” which is in-
voked in the literature as a yardstick to categorize corpora [18]. While size is
a fuzzy criterion, especially as technological advances in computer power and
memory constantly push the boundaries forward, the distinction between small
and large corpora also reflects a difference in the “textual populations” they
attempt to “sample” [1]. Small corpora are usually collections of “specialized”
texts, intended to represent a specific domain, and include texts with homoge-
neous content (e.g. medicine), text type or genre (e.g. textbook), or both. On
the other hand, large corpora usually aim to represent a much wider population,
i.e. the whole of a language or language variety (e.g. British English), and for
this reason are also called “reference” or “general purpose” corpora.
In essence, small corpora are not dissimilar from the domain-specific collec-
tions of texts translators use as reference materials (cf. Section 1), the main
difference being in the way they are consulted, i.e. by corpus processing software
instead of word processing search facilites. The advantages associated with the
use of small corpora have been discussed in the literature, where such corpora
are also called, crucially for the purposes of the present paper, comparable,ad
hoc corpora (e.g. [28], [35]). Comparability here refers to the similarity of the
(target language) texts being collected, ideally both in terms of topic and text
type/genre, to the source text being translated, and/or to “equivalent” source
language texts; the ad hoc label puts the stress on yet another aspect usually
associated with these corpora, i.e. on the fact that they are typically built man-
ually for specific translation tasks. According to Aston [1], specialized, ad hoc
comparable corpora: a) may be perceived as more familiar by translators com-
pared to other corpus resources, especially since texts are consulted and selected
manually for inclusion in the corpus; b) facilitate the process of interpretation of
concordance data, since the likelihood of encountering irrelevant examples (e.g.
polysemous words used in many different senses) is reduced; c) provide assis-
tance in producing natural-sounding translation hypotheses based on bottom-up
search strategies, which reduce the risks connected with the “hypothesis formu-
lation – validation” cycle, which may result in overlooking potential translation
alternatives.
This does not mean that small, specialized corpora do not have their short-
comings: while being powerful pedagogical instruments, whose effectiveness in
the translation classroom as “performance-enhancing” tools has been convinc-
ingly argued for [35], they may not be the best option for professionals. Given
their small size, they may not contain enough matches to draw confident gener-
alizations about certain usages in the specialized domain in question, especially
in the case of rarer patterns above the word unit. Indeed, Varantola [35] sum-
marizes the advantages connected with small corpora as reassurance, meaning
that “[w]hen relevant corpus information [is] available, the users often [gain] re-
assurance for their strategic decisions as well as the actual lexical choices”. This
5
definition highlights precisely the potentially negative aspect of small corpora,
i.e. that small corpora simply may not provide useful evidence for the transla-
tion problem in hand and “will rarely document every word in an ST or TT”
[1]. Hence the tradeoff between effort (to gather a sufficient number of texts,
learn to use concordancing software, etc.) and effectiveness may be perceived by
translators as not favourable enough to justify the investment of their time.
2.2 The manufactured solution
Large, general corpora can enable translators to overcome this limitation. Mun-
day [26] and Hoey [23] provide evidence of contrastive insights about word usage
in different languages that can be gleaned from comparable reference corpora and
that are crucial for decision-making in translation. Philip [29] offers a general de-
scription of this process claiming that “[c]hoice in translation is related to choice
in the SL, and this can be identified by comparing a given expression against its
possible alternatives along the paradigmatic axis. [...] [I]f an equivalent paradigm
of choice is set up for the TL, the most suitable correspondences can be identified
and used in the translated text”. Unlike small corpora, general-purpose ones are
also much larger and more diverse in terms of texts sampled, making it possible
to document a large number of phraseological patterns and to relate them to
specific registers. As Aston [1] puts it, these corpora “can make [translators]
more sensitive to issues of phraseology, register and frequency, which are poorly
documented by other tools”, and hence can be used “as complements to tra-
ditional dictionaries and grammars”. Add to this that the process of compiling
large reference corpora has been “democratized” by the growth of the web, which
has made it possible for researchers worldwide to construct and make available
to the research community very large corpora in a variety of languages.
Yet reference corpora present at least three major limitations for translators.
First, the specialized domain under investigation may be underrepresented (or
totally absent) in the selected corpus, thus making the process of finding pat-
terns in context matching those in the source or target language text a fruitless
exercise. Conversely, “appropriate instances” [1] may be difficult to identify in
the corpus due to an embarrassment of riches: too many solutions for a query,
from widely different text types. Translators are forced to sift through very
many and/or potentially irrelevant results, e.g. in the case of polysemous words.
And third, these corpora tend to come with their own search interface, often
complex and not necessarily intuitive, such that getting acquainted with it may
be a rather daunting task for a time-pressed and not especially motivated pro-
fessional or student. Reference corpora of the web-as-corpus type [3] have the
further disadvantage of carrying no guarantee of quality or representativeness,
being automatically constructed. While we have argued elsewhere [16] that this
disadvantage is partly compensated for by their size and up-to-dateness, it re-
mains a weak spot from the perspective of someone in search of a quick, reliable
answer.
6
2.3 Using handmade and manufactured corpora
In this section we briefly exemplify the use of “traditional” specialized and ref-
erence corpora for a simulated practical task: the translation from English into
Italian of a Patient Information Leaflet (PIL) for paracetamol tablets. This text
type was selected because it is likely to be familiar to most readers while being
sufficiently specialized (in terms of subject domain) and conventional (in terms
of genre) to give an idea of how different corpora might help in its translation,
or fail to do so. Pedagogically, it is an ideal task since it requires students to ob-
serve, and hence be sensitized, to the existence of cross-linguistic phraseological
patterning in texts. Indeed, the task was also performed in class with a group
of BA-level students of translation. A small specialized comparable corpus was
constructed manually to help them to observe corresponding patterns in the two
languages. 9 leaflets in English and 9 in Italian were collected, all accompany-
ing paracetamol medicines, making sure that they were reliable specimens of
the genre. The whole process of searching the web through appropriate queries,
evaluating texts, discarding dubious ones and saving good ones, took about 5
hours, and provided us with a corpus of approximately 25,000 words (13,518 in
English and 11,832 in Italian).
Even such a small corpus can be useful for translating a highly conventional
text like this one. Taking a very straightforward example, the ST lists “ac-
tive ingredients” and “other ingredients”. By searching for the Italian name of
one ingredient from each category in the Italian subcorpus, say “paracetamolo”
(“paracetamol”) and “sodio” (“sodium”), and browsing the left-hand co-text,
one easily finds equivalents for “active ingredients” (i.e. “principi attivi”), and
“other ingredients” (“eccipienti”). The latter is especially tricky since an inex-
perienced translator may be misled to try to come up with a phrase equivalent
rather than a single word equivalent, and the correct translation may simply not
come to mind, especially when pressed for time. Since co-textual searches like
this one (using known equivalents in the target language as “anchors” to find
equivalents for unknown words used in their vicinity) take some time and effort,
a small carefully constructed corpus may be more practical than a larger, dirtier
one, provided of course that it contains some evidence about the expression in
question.
Large reference corpora also have their uses for this task, though different
from the kind just discussed: co-textual searches are certainly impractical, as are
searches involving genre conventions (of the kind: how do they say “possible side-
effects” in Italian PILs?). Instead, it would make sense to look up a phrase like
“serious heart condition” in a reference corpus of English, to find out whether
the adjective modifies the noun phrase or is an integral part of it, and then
look up a comparable reference corpus of Italian to find possible equivalents.
We would thus find that “serious” is the most frequent adjectival premodifier
of “heart condition” in ukWaC (32 occurrences), and that among the top 20
collocates in this position there are no obvious synonyms (if we exclude “life-
threatening”, 6 occurrences) and only two antonyms, i.e. “minor” and “mild”,
which taken together occur 5 times only. This would appear to suggest that the
7
adjective is part of the phrase, or at least that it forms a restricted collocation
with it [11]. In Italian, the unmarked position of the adjective in this case would
be following (rather than preceding) the noun phrase, since the pre-modifying
position implies a degree of subjective judgement [32]. A search for “cardiopatia”
(one of the terms for “heart condition” in Italian) followed by an adjective,
conducted on a comparable reference corpus of Italian (itWaC), shows that the
only viable equivalent for “serious” in this context is “grave” (5 occurrences, the
most frequent adjectival postmodifier with a non-technical meaning).6
3 The BootCaT way
As argued and exemplified in Section 2, both small ad hoc corpora and large
reference corpora can be of use in the translation process – to answer questions
related to genre conventions and terminology respectively. However, there are
cases in which the former provide too little or no evidence, and the latter too
much. For instance, in the ST we find the following sentence: “The product is
available in cartons of 8 or 16 capsules”. Possible translations for “carton” given
by an English-Italian dictionary (Oxford Paravia) and potentially acceptable in
this context are “scatola” and “confezione”. If we look up the word “capsule”,
the obvious Italian equivalent of “capsules”, in the Italian subcorpus of our ad
hoc corpus, hoping to find either “scatola” or “confezione” in the left-hand co-
texts, unfortunately we get no results, since none of the paracetamol products
described in the sampled texts come in capsules. Making a similar query to a
large reference corpus would be impractical, because of the enormous amount of
data one would have to sift through to find contextually-appropriate evidence:
itWaC has 1,902 occurrences of “capsule”, coming from the most varied types
of texts.
A possible solution in this case would be to combine the advantages of manual
and automatic methods of corpus building. This is the idea behind BootCaT [2],
a software tool that (partly) automates the process of finding reference texts on
the web and collating them in a single corpus. BootCaT is a multi-stage pipeline
requiring user input and allowing varying levels of control. In the first step, users
provide a list of single- or multi-word terms to be used as seeds for text collection.
These are then combined into “tuples” of varying length and sent as queries to
a search engine, which returns a list of potentially relevant URLs.7At this point
the user has the option of inspecting the URLs and trimming them; the actual
web pages are then retrieved, converted to plain text and saved as a single
file. This can then be interrogated using one of the standalone concordancers
available (e.g AntConc).8
6see http://wacky.sslmit.unibo.it/doku.php for information about ukWaC and
itWaC.
7As of August 2011 Yahoo! discontinued the API service used by BootCaT for URL
retrieval. At the time of revising this chapter, work is under way to port the system
to the Bing search engine.
8http://www.antlab.sci.waseda.ac.jp/software.html
8
Using BootCat one can build a relatively large quick-and-dirty comparable
corpus (typically of about 80 texts in each language, with default parameters
and no manual quality checks) in less than half an hour. The end product may
be of variable quality though, and since quality comes at a cost, a “clean” corpus
would require much more time and effort spent on the Corpus building task –
e.g. browsing/selecting URLs, trying a few re-runs with different sets of key-
words or tuples, manually browsing the collected corpus to discard low-quality
or irrelevant texts etc.. This flexible approach to the task makes BootCaT a very
useful tool for translators and translation students, which has been used in the
translation and terminology classroom to build small DIY corpora of varying size
and specialization [8] [15] [17], and whose potential is worth exploring further.
3.1 Beyond topic: BootCaT for genre-restricted corpora?
In the “traditional” BootCaT pipeline, the first step in corpus creation consists
in selecting seeds “that are expected to be representative of the domain under
investigation” [2]. In the case of a translation task, the web pages that are
retrieved by the tool based on these seeds are usually expected to deal with
the same topic as the ST to be translated. However, topic similarity is not
the only criterion of text selection one might adopt for documentation purposes:
depending on the task, it can be argued that genre is equally if not more crucial.9
As suggested by Crowston and Kwasnik [13], “[b]ecause most genres are char-
acterized by both form and purpose, identifying the genre of a document provides
information as to the document’s purpose and its fit to the user’s situation”. This
applies equally to information retrieval in general and to text collection for ref-
erence purposes in particular: the importance of genre comparability has indeed
been repeatedly stressed in the literature on specialized comparable corpora for
translation, as discussed in Section 2. Retrieving web pages belonging to a spe-
cific genre automatically is however far more complex than retrieving pages on
a specific topic.
From the perspective of the translator in need of a rough-and-ready DIY
genre-restricted corpus, the techniques proposed within IR are far too computa-
tionally complex and/or require extensive linguistic modelling. Available genre
classification schemes also seem inadequate, since they are unlikely to target the
desired genre(s), or be available for the desired language(s) [31]. Even systems
designed specifically for creating multilingual comparable corpora [20] usually
present the drawback of needing substantial tuning if they are to be applied to
languages other than those for which they were originally created.
In Section 3.2 we propose a naive approach to constructing a “genre-driven”
corpus using BootCaT, i.e. a method which is intended to favour the retrieval
of pages belonging to the same genre, instead of topic, as the ST under analysis.
Frequent multi-word sequences have been suggested to be valuable indicators
9In this paper we define genre (loosely based on Swales [34]) as a recognizable set of
communicative events with a shared purpose and common formal features.
9
of genre: Biber and Conrad, e.g., use “lexical bundles” (i.e. frequent, uninter-
rupted 4-word sequences) to describe register variation between conversation and
academic prose [5]; Gries and Mukherjee compare regional variations of Asian
English in the ICE corpus using word sequences of varying length [21]. Along
similar lines, n-grams have proved to be reliable as discriminating cues in com-
putational approaches to genre classification (cf. [12] and references therein).
The approach we propose consists in using, instead of topic-specific keywords,
the nmost frequent trigrams from the manual corpus as input to the BootCaT
pipeline, regardless of their being intuitively salient, syntactically complete, or
lexically rich.
3.2 Corpus construction
As a starting point for constructing the BootCaT corpus we used the manually-
constructed bilingual comparable corpus discussed in Section 2.3 above. For
the topic-driven corpus, keywords were obtained from both subcorpora using a
sub-set of the Europarl corpus as a reference corpus in AntConc. The top 50
keywords were selected; proper nouns were removed and the remaining words
were lowercased and used as seeds, without further manual trimming (43 words
in English and 45 in Italian). For the genre-driven corpus no reference corpus
was required. We took the 50 most frequent trigrams in the manual subcorpora,
removed those containing proper nouns and numbers, and lowercased them. The
final lists of seeds contain 41 English trigrams and 46 Italian ones. The seeds were
then imported into BootCaT and tuples were formed.10 To partly compensate for
the use of phrases for the genre-driven corpus, which results in a higher number
of words in the queries, we used longer tuples for the topic-driven corpus queries
(5 single words) than for the genre-driven corpus queries (3 trigrams). Table 1
shows the first three queries used in the construction of the English subcorpora:
Table 1. Examples of tuples used for the construction of the English subcorpora.
English-G
“the side effects” “inform your doctor” “you need to”
“if you take” “solution for infusion” “doctor or pharmacist”
“and what it” “effects not listed” “to your doctor”
English-T
mg mixture or ingredients your
influenza pain symptoms please doctor
bowl syringe capsules use kg
Default parameters were used (10 tuples, 10 URLs per query) and no manual
filtering of results was performed. Size information about the resulting corpus
10 We used the frontend developed by Eros Zanchetta [37] and available here: http:
//bootcat.sslmit.unibo.it/.
10
are provided in Table 2. Notice that the Italian-G subcorpus is much smaller
than the rest. Another round of querying/text retrieval would have increased
corpus size and made it comparable to the other subcorpora; however, in this
case we favoured comparability in terms of corpus construction procedure over
comparability of the final output.
Table 2. Size data about the BootCaT subcorpora.
subcorpus number of texts number of words
English-G 89 166,276
English-T 95 174,397
Italian-G 36 60,478
Italian-T 76 133,965
Interestingly, the topic- and genre-driven corpora display very little overlap
regardless of the fact that the seeds were obtained from the same text collec-
tions. The English ones share as few as 4 web pages (out of 89 in English-G and
95 in English-T), while the Italian ones only have one URL in common (out of
36 in Italian-G and 76 in Italian-T), thus suggesting that different seed selec-
tion methods do yield different results. In the next section, these differences are
explored further.
3.3 Evaluation and discussion
As a way of evaluating the output of the procedures employed to build topic-
and genre-driven subcorpora, we asked informants to evaluate a sample of texts
in terms of their perceived usefulness for a translation task. We randomly ex-
tracted the URLs of 10 texts chosen among the non-shared ones from each of
the subcorpora, corresponding to between 10.5% and 27.8% of the total URLs in
the subcorpora, a representative sample according to [33]. They were mixed and
presented to respondents in random order, as lists of 20 English and 20 Italian
URLs.
The respondents were 5 translation teacher colleagues and 26 (BA/MA) stu-
dents of translation with varying experience of corpus work who accepted to
participate in the experiment on a voluntary basis. We provided them with a
“model” ST, i.e. the PIL discussed in Section 2.3, and asked them to a) open
and quickly read through the web pages associated with the URLs; b) rate them
according to whether they would include the texts in a corpus they would build
for translating the ST (possibilities were: “definitely yes”, “probably yes”, “prob-
ably no” “definitely no”); c) optionally, provide comments as to the reasons for
their decision.
Table 3 shows the absolute number of relevant texts in each subcorpus based
on respondent assessment. For a text to count as relevant, more than 50% of the
11
Table 3. Overall number of relevant texts split by subcorpus.
subcorpus relevant texts
English-G 9
English-T 4
Italian-G 7
Italian-T 8
participants had to judge it “definitely” or “probably” appropriate for inclusion
in a comparable corpus resource for the simulated translation task. As can be
appreciated, the genre-driven corpora performed comparably well, or even better
in the case of English-G, compared to the ones built following the traditional,
topic-driven procedure. The subcorpus with the worst results was English-T,
for which only 4 texts out of 10 were deemed useful by the majority of the
participants, while its Italian counterpart scored 2nd best (8 “good” texts out of
10).
Fig. 1. Positive answers (%) for each text split by subcorpus
If we take a closer look at the results, observing the percentage of positive
answers for each text in the 4 subcorpora and the related overall averages (Fig-
ure 1), we notice that besides containing the highest number of relevant texts
12
in absolute terms, English-G also has the highest average for positive answers
(70.2%), thus possibly pointing at high overall quality of the results. The poorer
performance of Italian-G can only partly be explained by the lower number of
relevant texts (7 vs. 9): in fact its lower average (61.2%) seems to be due to
the presence of 3 clearly “wrong” texts, each obtaining less than 20% positive
answers, with the remaining 7 texts above the 60% threshold. The scenario is
reversed for the topic-driven corpora. The average for Italian-T is very close to
that for English-G (68.1%), and much higher than English-T (44.8%). The texts
in Italian-T seem to display comparable, if not higher, percentages compared to
the English genre-driven corpus, but a single text (number 10) lowers the aver-
age considerably. Finally, English-T displays the lowest average, resulting from
a high number of possibly irrelevant texts, all well below the 50% threshold.
While these results are encouraging in terms of the output quality of the
proposed corpus-building procedure, they still do not shed light on whether
the genre-driven corpus does indeed contain a higher number of comparable
texts at the genre level. The data at our disposal do not allow us to settle the
issue here, but the comments provided by the users offer interesting insights
into the criteria adopted for judging the relevance of the texts. These include
claims about the perceived “authoritativeness” of the websites where the texts
are published and the terminological richness of the texts themselves (how many
“technical terms” are found), but also, crucially, observations about topic- and
genre-comparability, which are mentioned either as explicit criteria (e.g. “this is
a patient information leaflet, but deals with a different medicine compared to
the ST”), or implicit ones (e.g. “here terminology is probably pertinent”). In
this respect it is interesting to notice that comments about all the texts with
more than 80% positive answers hint at the fact that the texts are PILs (same
genre as the ST), no matter whether they refer to paracetamol medicaments or
not (same topic as the ST). On the other hand, the highest scoring texts (>90%:
1 in English-T, 2 in Italian-G, 1 in Italian-T) are those where both genre and
topic match those of the ST.
More in-depth investigation would be required to evaluate and compare the
output of the genre-driven corpus construction pipeline, and to estimate the
extent to which genre comparability influences decisions regarding relevance of
texts. However, the results we obtained adopting a relatively straightforward
pipeline, are encouraging both for English and for Italian: starting from the same,
manual corpus, and using a genre-driven procedure of seed selection besides
the traditional topic-driven one, the size of the resulting corpus is doubled for
English, and substantially increased for Italian (cf. Table 2), with comparable
levels of perceived relevance. Furthermore, using n-grams as seeds makes seed
selection more straightforward, since no reference corpus is required (differently
from the topic-driven pipeline). This is certainly an advantage for translation
professionals, who may not have a reference corpus available – or even understand
the need for it.
Finally further analysis would be in order to shed light on the reasons for
the varying quality and quantity of texts retrieved through the topic- and genre-
13
driven procedures for the two languages. Regarding quality, our tentative expla-
nation is that the reduced relevance of the texts in the English-T subcorpus may
be accounted for by the number of function words (e.g. “or”, “your”, cf. Table
1) that were key in the manual corpus and therefore made it to the seed list –
remember that the lists were not manually edited to avoid inserting subjective
biases in the procedure. As for quantity, the Italian subcorpora are consistently
smaller than their English counterparts, more noticeably so in the case of the
genre-driven subcorpus. In general, we got the impression that, while English
PILs are available in different formats on the web, Italian ones are more likely to
be in pdf, a format that is currently not supported by the BootCaT frontend we
used. Specifically in the case of the Italian-G subcorpus, the n-gram procedure
may work less well for languages with a rich morphology: since actual chunks of
text are used as exact queries, and these include conjugated verbs, a negative
effect on recall may be expected. Of course in actual practice this can be eas-
ily overcome by varying parameters of seed-selection (e.g. shorter n-grams) or
increasing the number of queries submitted.
3.4 Using the BootCaT corpus
Going back to the example in Section 2.3, the Italian BootCaT corpus returns
100 occurrences of the word “capsule” (English “capsules”), vs. 0 in the manual
corpus and 1,902 in the reference corpus. Browsing the left-end co-texts (within
a span of 10 words preceding) we find both “scatola” (2 occurrences) and “con-
fezione” (2 occurrences), but also the more frequent, “astuccio” (6 occurrences),
not mentioned by the dictionary.
Moving beyond the lexical/terminological level, and considering cross-linguistic
genre conventions, this comparable corpus provides evidence about the level of
presupposition to be aimed at by the translator that would be difficult to ob-
tain otherwise. The English ST is very reader-friendly, providing explanations
of most terms (including rather common names for illnesses and medicines, e.g.
“hypertension”, “hallucinations”, “vasodilators”, “antidepressants”). A transla-
tor may wonder whether this tendency to explain and define terms is peculiar
to this text or a convention of the genre, and whether Italian comparable texts
explain domain-specific terminology to a similar degree. The first question can
be answered by searching the English subcorpus, the second by searching the
comparable Italian subcorpus; both are crucial for deciding how to tackle pre-
supposition. For instance, should a rather obscure medical term like “tromboci-
topenia” (corresponding to English “thrombocytopenia”, and meaning reduced
blood platelet count) be accompanied by an explanation in Italian, as it is in
the English ST, or not? The manual English subcorpus does not contain the
word, suggesting that it might be quite rare in this genre. The automatic one
returns 16 occurrences, 7 coming from the topic-driven corpus and 9 from the
genre-driven one. None of the occurrences from the topic-driven corpus are ac-
companied by an explanation, while 7 (out of 9) of those from the genre-driven
corpus are, suggesting that the ST is indeed following generic conventions in this
case. Browsing both the manual and the automatic Italian corpus for evidence
14
about corresponding generic conventions shows that the term is both more com-
mon than its English equivalent (9 occurrences in the manual corpus, 59 in the
automatic one), and that it is virtually never accompanied by an explanation.
Based on this evidence a translator should assume that the term does not require
an explanation in Italian, and leave it out of the TT.
4 Conclusion
In this chapter we have presented a user’s perspective on comparable corpora. We
have discussed ways in which these resources can be used for reference purposes
in a translation task, and highlighted positive features with respect to other, bet-
ter established tools (electronic dictionaries, the web (though search engines),
translation memories). Different types of comparable corpora were presented
and their use exemplified, i.e. small ad hoc corpora, large web-derived reference
corpora, and interactively constructed semi-automatic corpora, occupying a mid-
dleground between the former two. From the user’s point of view, these resources
are positioned along two double clines. First, in terms of usefulness/quality vs.
quantity: manually constructed corpora are very small, reliable and tuned to the
task; as we move along the cline to semi-automatic and (web-derived) reference
corpora, reliability and specialization decrease while size increases. At the same
time, there is a cline in terms of time and effort required to obtain the corpora –
maximum for manual corpora, minimum for reference corpora which are avail-
able in the public domain – and time and effort required for corpus searching and
decision-making – minimum for the small corpora whose contents are familiar to
the user and have been evaluated prior to inclusion in the corpus, and maximum
for the huge reference corpora built by others for a host of different purposes;
semi-automatic corpora once again occupy the middleground between these two
opposites.
While we have suggested that these different corpus types do not provide
alternatives but rather complementary resources, to be used in different ways
and for different purposes during the translation process, we also believe that
language services providers and students of translation will only engage with
comparable corpora if these provide a positive tradeoff between the time and
effort needed to construct and/or (learn to) consult them and their perceived
usefulness.Given the reference needs of translators, the midway solution offered
by semi-automatic corpora appear to be the most fruitful and likely to be taken
on board by the profession.
The way forward should therefore be that of developing corpora and meth-
ods for constructing corpora that are simple, fast, and flexible, allowing (but
not imposing) a degree of user control over corpus contents. We aim to pursue
this objective in two main ways in future work. On the one hand, we are ex-
perimenting with other methods (besides BootCaT) for constructing specialized
web-derived corpora on-the-fly. In particular, we are trying to tap the poten-
tial of Wikipedia as a repository of “virtual” comparable corpora of English
and Italian [4]. All the linked entries in these languages have been downloaded,
15
POS-tagged, lemmatized and indexed with the Corpus Workbench [10], forming
a comparable corpus which is the sum of several hundred articles on the same
topics. Using keywords derived from the human-generated categories accompa-
nying the entries, users can select subsets of texts on the same topics from the
two subcorpora, thus obtaining virtual comparable corpora that last the time of
a search session. On the other hand, we aim to explore further the issues of cor-
pus comparability and quality through user surveys, trying to understand how
humans go about the task of selecting web texts in two or more languages for a
specialized corpus/specific task. Hopefully some of these strategies can be used
to improve semi-automatic corpus building methods; certainly they will help us
shed some light on notions such as corpus comparability and adequacy, which
have been at the bases of corpus linguistics since its early days.
5 Acknowledgments
We would like to thank the students and colleagues who have kindly accepted to
evaluate the URLs for us, Claudia Lecci for her expert insights about TM soft-
ware, Federico Gaspari for fruitful lunchtime discussions on corpus construction
strategies as well as the anonymous reviewer and the editors of the volume for
their valuable feedback and suggestions.
References
1. Guy Aston. Corpus use and learning to translate. Textus, 12:289–314, 1999.
2. Marco Baroni and Silvia Bernardini. Bootcat: Bootstrapping corpora and terms
from the web. In Proceedings of LREC 2004, pages 1313 – 1316, Lisbon, 2004.
ELDA.
3. Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. The
wacky wide web: A collection of very large linguistically processed web-crawled
corpora. Language Resources and Evaluation, 43(3):209–226, 2009.
4. Silvia Bernardini, Sara Castagnoli, Adriano Ferraresi, Federico Gaspari, and Eros
Zanchetta. Introducing comparapedia: a new resource for corpus-based translation
studies. Paper presented at the International Symposium on Using Corpora in Con-
trastive and Translation Studies (UCCTS 2010), Edge Hill University, Ormskirk,
July 2010.
5. Douglas Biber and Susan Conrad. Lexical bundles in conversation and academic
prose. In Hilde Hasselgard and Signe Oksefjell, editors, Out of Corpora: Studies in
Honour of Stig Johansson, pages 181–90. Rodopi, Amsterdam, 1999.
6. Lynne Bowker. Computer-Aided Translation Technology: A Practical Introduction.
University of Ottawa Press, Ottawa, 2002.
7. Lynne Bowker. Examining the impact of corpora on terminographic practice in
the context of translation. In Alet Kruger, Kim Wallmach, and Jeremy Munday,
editors, Corpus-Based Translation Studies. Continuum, London, Forthcoming.
8. Sara Castagnoli. Using the web as a source of lsp corpora in the terminology
classroom. In Marco Baroni and Silvia Bernardini, editors, Wacky! Working Papers
on the Web as Corpus, pages 159–172. GEDIT, Bologna, 2006.
16
9. Ziad Chama. From segment focus to context orientation. TC World, 2010. online:
http://www.tcworld.info/index.php?id=167.
10. Oliver Christ. A modular and flexible architecture for an integrated corpus query
system. In Proceedings of COMPLEX 1994, pages 23–32, Budapest, 1994.
11. Anthony P. Cowie, editor. Phraseology: Theory, Analysis, and Applications. Oxford
University Press, Oxford, 2001.
12. Scott A. Crossley and Max M. Louwerse. Multi-dimensional register classification
using bi-grams. International Journal of Corpus Linguistics, 12(4):453–478, 2007.
13. Kevin Crowston and Barbara H. Kwasnik. A framework for creating
a facetted classification for genres: Addressing issues of multidimensional-
ity. Hawaii International Conference on System Sciences, 4, 2004. online:
http://doi.ieeecomputersociety.org/10.1109/HICSS.2004.1265268.
14. Alain Dsilets, Christiane Melanon, Genevive Patenaude, and Louise Brunette. How
translators use tools and resources to resolve translation problems: An ethno-
graphic study. In Proceedings of MT Summit XII – Workshop: Beyond Translation
Memories, Ottawa, 2009.
15. Claudio Fantinuoli. Specialized corpora from the web and term extraction for
simultaneous interpreters. In Marco Baroni and Silvia Bernardini, editors, Wacky!
Working Papers on the Web as Corpus, pages 173–190. GEDIT, Bologna, 2006.
16. Adriano Ferraresi, Silvia Bernardini, Giovanni Picci, and Marco Baroni. Web cor-
pora for bilingual lexicography: A pilot study of english/french collocation extrac-
tion and translation. In Richard Xiao, editor, Using Corpora in Contrastive and
Translation Studies, pages 337–359. Cambridge Scholars Publishing, Newcastle,
2010.
17. Maristella Gatto. From Body to Web. An Introduction to the Web as Corpus.
Laterza, Bari, 2009.
18. Laura Gavioli. Exploring Corpora for ESP Learning. Benjamins, Amsterdam,
2005.
19. Mohsen Ghadessy, Alex Henry, and Robert L. Roseberry, editors. Small Corpus
Studies and ELT. Benjamins, Amsterdam, 2001.
20. Lorraine Goeuriot, Emmanuel Morin, and Beatrice Daille. Compilation of special-
ized comparable corpus in french and japanese. In Proceedings of the ACL-IJCNLP
workshop Building and Using Comparable Corpora (BUCC 2009), 2009.
21. Stefan Th. Gries and Joybrato Mukherjee. Lexical gravity across varieties of en-
glish: an ice-based study of n-grams in asian englishes. International Journal of
Corpus Linguistics, 15(4):520–548, 2010.
22. Ulrich Heid. Corpus linguistics and lexicography. In Merja Kyt¨o and Anke
udeling, editors, Corpus Linguistics – An International Handbook, pages 131–
153. Mouton de Gruyter, Berlin and New York, 2008.
23. Michael Hoey. Lexical priming and translation. In Alet Kruger, Kim Wallmach, and
Jeremy Munday, editors, Corpus-Based Translation Studies. Continuum, London,
Forthcoming.
24. MeLLANGE. Corpora and e-learning questionnaire. results summary – profession-
als. Internal Document, 2006.
25. MultiTrans. Multitrans 4(tm): Taking the multilingual textbase
approach to new heights. MultiCorpora White Paper, online:
http://www.multicorpora.com/filesNVIAdmin/File/MCwhitepaper1.pdf, Au-
gust 2005.
26. Jeremy Munday. Looming large: A cross-linguistic analysis of semantic prosodies
in comparable reference corpora. In Alet Kruger, Kim Wallmach, and Jeremy
17
Munday, editors, Corpus-Based Translation Studies. Continuum, London, Forth-
coming.
27. Jennifer Pearson. Terms in Context. Benjamins, Amsterdam, 1998.
28. Jennifer Pearson. Using parallel texts in the translator training environment. In
Federico Zanettin, Silvia Bernardini, and Dominic Stewart, editors, Corpora in
Translator Education, pages 15–24. St Jerome, Manchester, 2003.
29. Gill Philip. Arriving at equivalence: Making a case for comparable general refer-
ence corpora in translation studies. In Allison Beeby, Patricia Rodrguez Ins, and
Pilar Snchez-Gijn, editors, Corpus Use and Translating, pages 59–73. Benjamins,
Amsterdam, 2009.
30. Adriane Rinsche and Nadis Portera Zanotti. Study on the Size of the Language
Industry in the EU. European Commission – Directorate General for Translation,
Brussels, 2009.
31. Marina Santini. State-of-the-art on automatic genre identification. Technical Re-
port ITRI-04-03, ITRI, University of Brighton (UK), 2004.
32. Luca Serianni. Grammatica Italiana. UTET, Torino, 1991.
33. Serge Sharoff. Creating general-purpose corpora using automated search engine.
In Marco Baroni and Silvia Bernardini, editors, Wacky! Working Papers on the
Web as Corpus, pages 63–98. GEDIT, Bologna, 2006.
34. John Swales. Genre Analysis. English in Academic and Research Settings. Cam-
bridge University Press, Cambridge, 1990.
35. Krista Varantola. Translators and disposable corpora. In Federico Zanettin, Silvia
Bernardini, and Dominic Stewart, editors, Corpora in Translator Education, pages
55–70. St Jerome, Manchester, 2003.
36. Ian A. Williams. A translator’s reference needs: Dictionaries or parallel texts.
Target, 8:277–299, 1996.
37. Eros Zanchetta. Corpora for the masses: The bootcat front-end. Pecha Kucha
presented at the Corpus Linguistics 2011 Conference, University of Birmingham,
Birmingham, July 2011.
... All'interno dei corpora è possibile anche effettuare una ricerca di keywords, ossia generare una lista di parole tipiche del dominio derivante dal confronto tra la lista di frequenza del corpus specialistico e un corpus di riferimento 5 . Inoltre AntConc permette anche la navigazione "a tutto testo" e offre una serie di ricche informazioni contestuali necessarie per il processo di selezione dei termini (Bernardini, Ferraresi 2013). Se tutte queste operazioni vengono svolte parallelamente in lingua di partenza e in lingua di arrivo, creando di volta in volta i match interlinguistici, sarà molto semplice compilare un glossario bilingue accurato. ...
Chapter
Full-text available
This chapter focuses on the cognitive processes of interpreting, with the intention of providing the student with a knowledge base to become aware of the cognitive dynamics of this activity. In the first part of the chapter, basic notions of brain structures and functions of language are provided, psycholinguistic models of the simultaneous interpreting (SI) process are illustrated - with particular reference to the difference between experienced and novice interpreters -, and the results of functional magnetic resonance imaging studies are described to highlight the effects of constant Si practice on brain areas involved in language processing. The second part of the chapter delves into the executive functions that are essential for performing a cognitively complex multitasking task such as SI, namely working memory (WM), inhibition and cognitive flexibility. After providing basic knowledge on memory and illustrating the role of WM, the concepts of selective attention, attention inhibition and cognitive flexibility are explored. Subsequently, exercises are suggested to enhance these functions in order to develop specific skills. Finally, for those who would like to learn more about the methodology of cognitive research in the field of conference interpreting, a brief review of the most commonly used cognitive tests in empirical research on interpreting is included.
... K-factor metodu bir neçə sözdən ibarət terminlərin avtomatik çıxarılması üçün nəzərdə tutulub və BootCaT sistemində reallaşıb [26]. BootCaT veb-in tematik korpusunun avtomatik formalaşmasına xidmət edir. ...
Article
Məqalədə mətnlərdən terminlərin avtomatik çıxarılmasının beş metodu tədqiq olunmuş və onların müqayisəli analizi verilmişdir. Mətnlərdən terminlərin çıxarılmasının ümumi məqsədi xüsusi sahənin əsas lüğətinin təyin edilməsidir. Terminlərin ənənəvi olaraq əl ilə çıxarılmasından fərqli olaraq avtomatik çıxarılması vaxt aparan bu işi sadələşdirmək üçün kompüterləşdirilmiş bir vasitədir və termin-namizədlərin əvvəlcədən müəyyənləşdirilməsinin avtomatlaşdırılmasına yönəlib. Hazırkı dövrdə bir çox sahələrdə (leksikoqrafiya, terminşünaslıq, informasiya axtarışı və s.) emal olunmalı informasiyanın həcminin artım dinamikası termin və açar sözlərin avtomatik seçilməsi məsələsini xüsusilə aktual edir. Təbii dilin emalı sahəsində qurulan qaydaları təqdim edən mətnlərdən terminlərin avtomatik çıxarılması üçün bir çox fərqli yanaşma və sistem işlənmişdir. Mətnlərdən terminlərin avtomatik çıxarılmasının müxtəlif altməsələləri – korpus kolleksiyası, vahid birliklər, termin və variantların müəyyən olunması və keyfiyyətin qiymətləndirilməsi qaydası təqdim olunmuşdur. Müəyyən predmet sahəsi üçün mətnlərdən terminlərin avtomatik çıxarılmasına tətbiqi yanaşma verilmişdir. "İnformasiya texnologiyaları problemləri" və "İnformasiya cəmiyyəti problemləri" jurnallarının məqalələrinin korpusu üzərində eksperiment aparılmışdır. Ekspert və formal qiymətləndirmə metodikası təklif olunmuş, terminlərin avtomatik çıxarılması metodlarının müqayisəli qiymətləndirilməsinin nəticələri verilmişdir.
... [6] says that the world wide web is not only a tool for information retrieval and exchange but also a massive repository of authentic data, "a self-renewing linguistic resource" offering "a freshness and topicality unmatched by fixed corpora." However, with all the researches that have been done, [7] says that the use of corpus by different language service providers and language professional remains limited due to the existence of computing resources that are likely to be perceived as less demanding regarding time and effort required to obtain them. ...
Article
Full-text available
There has been a great effort in the collection of different languages in the past years all over the world, and the development of online corpus outside the country brought new possibilities in the Philippines. However, there is a limited resource for the Ilokano Language. This paper introduces the Corpus of Spoken Ilokano Language, an online repository of spoken Ilokano in the Philippines specifically in region 1. The main component of this study is spoken Ilokano. It has been specifically built for natural language processing. It shows the difference of Ilokano language as spoken by Ilokanos in the region. The database consists of 160 speakers, 40 speakers in each province of the region, each speaking about 74 statements. Spoken Ilokano language was audio recorded and transcribed. A web application has been developed making the dataset available online. The corpus was validated to provide a useful resource of data that can be used for automatic speech recognition models.
... Moreover, it is conceivable that the extracted texts can be used for other practical applications as well, such as computer-assisted language learning and translation (Delpech, 2014), cross-linguistic translation studies (Bernardini and Ferraresi, 2013) or terminology extraction (Morin et al., 2013). The latter research direction has received particular attention in the past twenty years, as it offered an effort-saving alternative to the manual compilation of dictionaries. 1 Moreover, language professionals show increased interest for automatic technologies, which have the potential to minimize their workload. ...
... Step 3 consists of one or two translation exercises where students need to translate sentences by using the information collected from both the observation of the comparable corpus and of the parallel corpus. The general aim is to help students write natural-sounding, idiomatic translated texts, by having them use both "manufactured" and "do-it-yourself" (DIY) corpora (Bernardini & Ferraresi 2013). ...
Conference Paper
Full-text available
Book of abstracts TALC 2018 13th Teaching and Language Corpora Conference. University of Cambridge.
Chapter
This section concerns applications of comparable corpora beyond pure machine translation. It has been argued [1, 2] that downstream applications such as cross-lingual document classification, information retrieval or natural language inference, apart from proving the practical utility of NLP methods
Chapter
In a parallel corpus we know which document is a translation of what by design. If the link between documents in different languages is not known, it needs to be established. In this chapter we will discuss methods for measuring document similarity across languages and how to evaluate the results. Then, we will proceed to discussing methods for building comparable corpora of different degrees of comparability and for different tasks.
Chapter
Full-text available
This collection of studies focuses on the translation of the language of art and cultural heritage in a wide variety of text types from the Renaissance to the present, following different theoretical and methodological approaches ranging from corpus linguistics to lexicography, terminology, and translation studies. This book is meant for a wide audience including scholars and students of languages for special purposes, as well as professional translators and experts in the international communication of cultural heritage. These studies have been carried out as part of the Multilingual Cultural Heritage Lexicon research project (Lessico plurilingue dei Beni Culturali). An initiative which first originated at the University of Florence, now involving multiple Italian and international universities, this project is dedicated to compiling textual databases and plurilingual dictionaries through comparable and parallel corpora.
Thesis
Full-text available
Η παρούσα διδακτορική διατριβή με τίτλο «Η χρήση των Ηλεκτρονικών Σωμάτων Κειμένων στη Διδακτική της Ειδικής Μετάφρασης: μια Θεωρητική και Πρακτική Προσέγγιση» εστιάζει στο αντικείμενο των ειδικών ad hoc σωμάτων κειμένων, της κατασκευής, δηλαδή, εξατομικευμένων κειμενικών συλλογών, για τους σκοπούς της ειδικής μετάφρασης, με τη χρήση ενός εξελιγμένου εργαλείου αυτόματης συγκρότησης σωμάτων κειμένων. Στόχος της διατριβής αποτελεί, πιο συγκεκριμένα, η διερεύνηση του βαθμού διάδοσης των σωμάτων κειμένων και της τεχνολογίας σωμάτων κειμένων στα προγράμματα εκπαίδευσης μεταφραστών των ελληνικών ανώτατων εκπαιδευτικών ιδρυμάτων, και η αναζήτηση ενός εύχρηστου και αποτελεσματικού –στο πλαίσιο της ειδικής μετάφρασης– τρόπου αξιοποίησης αυτής της πηγής τεκμηρίωσης. Η συμβολή της διατριβής εντοπίζεται σε δύο επίπεδα: • Αναδεικνύει ένα σημαντικό κενό στην έρευνα για την αξιοποίηση του WebBootCat, ενός εργαλείου αυτόματης κατασκευής σωμάτων κειμένων, και των καινοτόμων εργαλείων ανάλυσης του Sketch Engine. • Προτείνει ένα εκπαιδευτικό μοντέλο που βασίζεται στη συνδυαστική χρήση ενός εργαλείου αυτόματης κατασκευής σωμάτων κειμένων (WebBootCat) και μίας μεταφραστικής μνήμης (όπως το Trados), εντάσσοντας δυναμικά και πρακτικά, στην εκπαίδευση μεταφραστών, δύο σύγχρονα υπολογιστικά εργαλεία υποστήριξης της μεταφραστικής πράξης. Συνολικά, η έρευνά μας διαπιστώνει την απουσία των σωμάτων κειμένων και της τεχνολογίας σωμάτων κειμένων, από τα μεταφραστικά εργαλεία που χρησιμοποιούνται στα εκπαιδευτικά περιβάλλοντα, και η διατριβή θέτει τα θεμέλια για την αξιοποίηση ενός εξελιγμένου εργαλείου αυτόματης κατασκευής σωμάτων κειμένων από πηγές του διαδικτύου, το οποίο σε συνδυασμό με τη χρήση μεταφραστικών μνημών δύναται να συμβάλει στη βελτίωση των δεξιοτήτων του μεταφραστή (σύμφωνα με το προφίλ δεξιοτήτων που ορίζει το δίκτυο «Ευρωπαϊκό Μάστερ στη Μετάφραση») σε γλωσσικό, μεταφραστικό και τεχνολογικό επίπεδο. This doctoral thesis entitled "The use of Electronic Corpora in the Teaching of Specialised Translation: a Theoretical and Practical Approach" focuses on the subject of ad hoc specialised corpora, i.e. the compilation of individualised text collections using a sophisticated automatic corpus-building tool, for the purposes of specialised translation. More specifically, the aim of this doctoral thesis is to investigate the degree of integration of corpus use and corpus technology in translator training programs offered by the Greek higher education institutions and to explore a user-friendly and effective way of exploiting this source of documentation. The contribution of this doctoral thesis lies in: • Indicating a significant research gap concerning the use of WebBootCat, an automatic corpus-building tool, and the Sketch Engine innovative corpus analysis tools. • Proposing an educational model based on the combined use of an automatic corpus-building tool (WebBootCat) and a translation memory (such as Trados), thus, integrating dynamically and practically into translator training two modern computing tools to support the act of translating. Overall, our research shows that corpora and corpus technology are not included among the translation tools used in educational environments, and this doctoral thesis lays the foundation for the exploitation of a sophisticated tool for automatically constructing corpora from relevant web pages, which in combination with the use of translation memories it can contribute to the improvement of the translator’s competences (according to the translator competence profile developed by the European Master’s in Translation) on linguistic, translational and technological level.
Article
This article addresses the terminology used in the field of social media, a new field introduced by recent technological advances. As social media was first created in an English-speaking context, speakers of other languages have had to develop ways to express the concepts of this field in their own languages. It is thus relevant for translators and journalists to understand the linguistic means by which the transfer of such concepts occurs. In order to achieve this, a comparable corpus of newspaper articles on social media (Facebook and WhatsApp) written in English, Spanish and Brazilian Portuguese was compiled and the procedures used by Brazilian Portuguese and Spanish journalists to convey frequently-used English terms were examined. The main translation procedures observed in the study are equivalence, calque, loan and paraphrase. Although most procedures occur in the texts written in both Spanish and Brazilian Portuguese, their frequency of use indicates linguistic preferences in the languages.