Conference PaperPDF Available

The EcoLexicon English Corpus as an Open Corpus in Sketch Engine

Authors:

Abstract and Figures

The EcoLexicon English Corpus (EEC) is a 23.1-million-word corpus of contemporary environmental texts. It was compiled by the LexiCon research group for the development of EcoLexicon (Faber, León-Araúz & Reimerink 2016; San Martín et al. 2017), a terminological knowledge base on the environment. It is available as an open corpus in the well-known corpus query system Sketch Engine (Kilgarriff et al. 2014), which means that any user, even without a subscription, can freely access and query the corpus. In this paper, the EEC is introduced by describing how it was built and compiled and how it can be queried and exploited, based both on the functionalities provided by Sketch Engine and on the parameters in which the texts in the EEC are classified.
Content may be subject to copyright.
893Lexicography in gLobaL contexts
The EcoLexicon English Corpus as an Open Corpus in Sketch
Engine
Pilar León-Araúz1, Antonio San Martín2, Arianne Reimerink1
1Department of Translation and Interpreting, University of Granada
2Department of Modern Languages and Translation, University of Quebec in Trois-Rivières
E-mail: pleon@ugr.es, antonio.san.martin.pizarro@uqtr.ca, arianne@ugr.es
Abstract
The EcoLexicon English Corpus (EEC) is a 23.1-million-word corpus of contemporary environmental texts. It
was compiled by the LexiCon research group for the development of EcoLexicon (Faber, León-Araúz & Reimer-
ink 2016; San Martín et al. 2017), a terminological knowledge base on the environment. It is available as an open
corpus in the well-known corpus query system Sketch Engine (Kilgarri󰏑 et al. 2014), which means that any user,
even without a subscription, can freely access and query the corpus. In this paper, the EEC is introduced by de-
scribing how it was built and compiled and how it can be queried and exploited, based both on the functionalities
provided by Sketch Engine and on the parameters in which the texts in the EEC are classi󰏒ed.
Keywords: specialized open corpus, terminology, corpus exploitation
1 Introduction
Corpora have become a key element of almost all language studies, as any assertion about language
requires veri󰏒cation through real linguistic data to be deemed credible (Teubert 2005: 1). Having
access to general and specialized corpora is thus essential for anyone involved in research or any
professional activity related to language. However, many of these professionals do not have time to
compile large corpora. The EcoLexicon English Corpus (EEC) is a 23.1-million-word specialized
corpus of contemporary environmental texts. It was compiled by the LexiCon research group for the
development of EcoLexicon (Faber et al. 2016; San Martín et al. 2017), a terminological knowledge
base on the environment.1 In EcoLexicon, the EEC and its Spanish counterpart (together over 50 mil-
lion words) can be queried with pragmatic restrictions such as author, date of publication, target
reader, contextual domain, and keywords. However, its search engine does not provide all the func-
tionalities of the well-known corpus tool Sketch Engine (Kilgarri󰏑 et al. 2014). This is why the EEC
was made available as an open corpus in Sketch Engine, which means that any user, even without a
subscription, can freely access and query the corpus.2 One very interesting module provided by the
query system is information extraction through word sketches, which are automatic corpus-derived
summaries of a word’s grammatical and collocational behavior (Kilgarri󰏑 et al. 2010). Apart from the
built-in word sketches, Sketch Engine allows users to customize sketches for their speci󰏒c needs. In
the case of the EEC, this has enhanced the extraction of semantic information.
In this paper, the EEC is introduced by describing how the corpus was built and compiled (Section 2),
and how it can be queried and exploited (Section 3), based on the functionalities provided by Sketch
Engine, the parameters in which the texts in the EEC are classi󰏒ed and the word sketches exclusively
created for the EEC. Finally, Section 4 o󰏑ers some concluding remarks.
1 EcoLexicon is freely available at <http://ecolexicon.ugr.es>.
2 Certain advanced functionalities are only available for subscribed users.
894 Proceedings of the XViii eUrALeX internAtionAL congress
2 Creating the EcoLexicon English Corpus
The EEC is a 23.1-million-word corpus of contemporary environmental texts. It was 󰏒rst created as
an internal tool for knowledge extraction while building EcoLexicon. However, it was made public-
ly available because it evolved to be a tool in itself that terminologists, translators or even experts
could exploit for di󰏑erent purposes (i.e. modeling, comprehension and production tasks) within the
specialized domain of the environment. As Sinclair (1991: 24) pointed out, we should not expect a
general reference corpus like the British National Corpus to adequately document specialized genres
and domains. It follows that we need more specialized corpora, compiled with enough texts and text
types to represent a knowledge domain, as they are more likely to document the conventions of the
genre and the concepts and terms of the domain.
Each text in the EEC is tagged according to a set of XML-based metadata, some of which are based
on the Dublin Core Schema, while others have been included to meet the needs of the research group.
Corpus metadata permit users to constrain corpus queries based on pragmatic factors, such as envi-
ronmental domains and target reader. Thus, for instance, the use of the same term in di󰏑erent contexts
can be compared. Tags are based on the following main parameters:
Domain: the EEC encompasses all the domains and subdomains of environmental studies (e.g.,
Biology, Meteorology, Ecology, Environmental Engineering, Environmental Law, etc.).
User: the corpus includes texts for three types of user, depending on level of expertise (i.e., ex-
pert, semi-expert, general public).
Geographical variant: it comprises American, British, and Euro English.
Genre: it covers a wide variety of text genres (e.g., journal articles, books, websites, lexicograph-
ical material, etc.).
Editor: it distinguishes texts edited by scholars/researchers, businesses, government bodies, etc.
Year: it includes texts from 1973 to 2016.
Country: the texts are tagged according to the country of publication.
The EEC was processed and compiled in an internal application of the research group. Then it was
recompiled within Sketch Engine with the Penn Treebank tagset (TreeTagger version 3.3) and with
the EcoLexicon Semantic Sketch Grammar (ESSG) (León-Araúz & San Martín 2018; León-Araúz,
San Martín & Faber 2016), a CQL-based (Corpus Query Language) (Jakubíček et al. 2010) custom-
ized sketch grammar separate from the default sketch grammar. The ESSG was developed for the
extraction of semantic word sketches based on some of the most common semantic relations in termi-
nology: generic-speci󰏒c, part-whole, location, cause, and function.
When a corpus is compiled with a collection of di󰏑erent pattern-based grammar rules such as the
above, new word sketches can be queried within the Sketch Engine (see Section 3.2). The ESSG
thus has three aims: (1) extracting semantic relations for building EcoLexicon; (2) o󰏑ering seman-
tic word sketches in the EEC; and (3) providing other users with the possibility of reusing them in
their own corpora.3
3 Exploiting the EcoLexicon English Corpus
The combination of pragmatic, syntactic and semantic information that can be extracted from the cor-
pus makes the EEC an adequate resource for all kinds of end users with an interest in environmental
science, such as domain experts, professional writers, translators, terminologists, ESP researchers,
3 The latest version of the ESSG can be downloaded from <http://ecolexicon.ugr.es/essg/>.
895Lexicography in gLobaL contexts
etc., as stated above. Thanks to Sketch Engine’s automation capabilities, users are able to analyze and
extract a sizable quantity of linguistic data that would have been unmanageable in the past (Kosem
et al. 2014: 362). In the following sections, di󰏑erent queries will be provided by combining the main
functionalities of Sketch Engine with the parameters according to which the EEC is tagged.4
3.1 Search and Text Types
The feature Search is the main way to access concordances in Sketch Engine. Di󰏑erent types of que-
ries are possible (simple, lemma, phrase, word, character and CQL), and they can be combined with
the contextual 󰏒lter, which allows the user to limit the lemmas that should appear around the word
or words of the query. Additionally, in the case of the EEC, any query performed through the Search
feature can be 󰏒ltered according to text type based on the tagging of the EEC (domain, genre, editor,
etc.) (Figure 1).
The 󰏒ltering by text type can be chosen manually for each query. However, the user can also create
subcorpora based on text types. For instance, a user may want to create a simple subcorpus for the
domains of Hydrology or Renewable Energy, or complex subcorpora ,such as one containing only
articles and books in British English from the domain of Biology for experts in the 󰏒eld. Additionally,
the EEC comes with several subcorpora created by default (i.e. American English, British English,
Year 1973–1999, Year 2000–2009 and Year 2010–2016).
Figure 1: Sketch Engine’s Search and EEC Text types.
All these possibilities of query customization allow the user to retrieve, for instance, all the concord-
ances where recycle is a verb in texts addressed to the general public (lemma search 󰏒ltered by user)
or where climate change occurs in Environmental Law texts (phrase search 󰏒ltered by domain). Addi-
tionally, the Context option can be combined with any search, permitting the user to 󰏒nd, for example,
all the concordances in Oceanography academic articles where the lemma wind appears in a window
of ±15 tokens of the lemma wave.
However, given that the EEC was recompiled with TreeTagger, it is possible to perform more 󰏒ne-
grained queries in CQL, allowing for the formalization of grammar patterns in the form of regu-
lar expressions combined with POS-tags. CQL queries used together with text-type 󰏒ltering are a
powerful tool to research the workings of environmental English. An example of a CQL query is
([tag=”N.*”] [lemma=”amount” & tag=”N.*”]) | ([lemma=”amount” & tag=”N.*”] [word=”of”]
[tag=”N.*”]), which 󰏒nds concordances of the lemma amount either preceded by any noun or fol-
lowed by of and any noun. Figure 2 shows a sample of the resulting concordances limited to the
Meteorology subdomain.
4 Due to space restrictions, no instructions are provided. However, interested readers can consult the user-friendly Sketch Engine
manual at: < http://sketchengine.co.uk/user-guide/>
896 Proceedings of the XViii eUrALeX internAtionAL congress
Figure 2: Sample of the results for the CQL query amount preceded by a noun or followed by of and any
noun in the Meteorology subdomain.
With CQL queries, a user can also compare the frequency of di󰏑erent variants of multiword expres-
sions. For example, in the term geologic time scale, geologic can be replaced by geological and time
scale can be written as a single word. With the CQL query [lemma=”geologic.*”] ([lemma=”ti-
mescale”]|([lemma=”time”] [lemma=”scale”])) we can retrieve all the concordances where all the
variants appear, and with the Frequency – Node forms feature we can see which form is more frequent
(Figure 3).
Figure 3: Frequency of variants of geologic time scale in the EEC.
Another feature of Sketch Engine that permits users to fully exploit the EEC is Frequency – Text type.
With this feature, users can observe how language expression changes across di󰏑erent levels of ex-
pertise in the environmental domain. For instance, when searching for the verb liquefy, concordances
can be 󰏒ltered according to the user type parameter. Not surprisingly, the verb appears more often in
expert-related texts than in texts addressed to the general public (Figure 4).
Figure 4: Frequency of liquefy in the EEC according to user type.
897Lexicography in gLobaL contexts
With this feature the frequency of terms in di󰏑erent domains can also be observed, thus verifying if
a term is more speci󰏒c to one domain or another. For instance, by searching the lemma photovoltaic
and looking up its frequency according to domain, the results show that it is a term mainly linked to
the domain of Renewable Energy, although it also occurs, but with much lower frequency, in Clima-
tology and Air Quality Management (Figure 5).
Figure 5: Frequency of photovoltaic in the EEC according to environmental subdomain.
3.2 Word Sketch and Sketch Di
The EEC employs both the default sketch grammar for English underlying the word sketches in the
tool in combination with the ESSG. Users can bene󰏒t from Sketch Engine’s default word sketches
when searching for the collocations that are used more often in specialized discourse in combination
with a certain term. For instance, Figure 6 shows the modi󰏒ers of methane, the nouns modi󰏒ed by
methane and the verbs that collocate with methane both as object and subject.
Figure 6: Word sketches of methane extracted from the EEC.
Thanks to the ESSG, users can access ready-made semantic word sketches such as those shown in
Figure 7, where search terms may appear related to their hyponyms (i.e. microorganism), the whole
they are part of (i.e. oxygen), their underlying causes (i.e. tsunami), etc.
Figure 7: Semantic word sketches of methane extracted from the EEC.
The word sketch queries can be complemented with the text type 󰏒lters provided by the tags of the
EEC (or subcorpora based on them). In this sense, users can also observe how concepts can change
their relational behavior across di󰏑erent environmental subdomains. For example, Figure 8 shows
how nitrogen is mainly categorized as a type of pollutant in the domain of Air Quality Management
and as a type of nutrient in that of Biology.
898 Proceedings of the XViii eUrALeX internAtionAL congress
Figure 8: Nitrogen generic-speci󰏒c semantic word sketches in Air Quality Management (left)
and Biology (right) subcorpora.
Additionally, if users access the concordances extracted with the ESSG, they can extract knowl-
edge-rich contexts (i.e. contexts containing domain knowledge potentially useful for conceptual anal-
ysis (Meyer 2001)) like the ones in Table 1.
Table 1: Sample of knowledge-rich contexts extracted from the EEC with the aid of the ESSG.
generic-
specic
A hydrograph is a graph that reects the discharge of a river over a period of time.
The astronomical tide refers to the regular oscillations of the sea or ocean surface[…].
part-whole Sand grains usually consist of quartz but may also be fragments of feldspar, mica, and, […].
Seawater contains sodium chloride and other salts in concentrations three times greater
[…].
location Lagoons commonly form on coastlines that are subsiding, or where sea level is rising.
Most ozone is found in the stratosphere at elevations between 10 and 50 kilometers […].
cause […] the human costs of malaria outweigh the environmental damage caused by the use of
DDT.
Logging may also contribute to deforestation by making it easier for agriculture to […].
function Membrane-assisted BAC is used for the removal of priority pollutants from secondary […].
Liquid-in-glass thermometers are often used for measuring surface air temperature
because […].
Another word-sketch based feature that can be especially exploited with the EEC is Sketch di. It
allows the user to compare either the word sketches of two lemmas, the word sketches on the same
lemma in two subcorpora, or two di󰏑erent word forms of the same lemma. Figure 9 shows an exam-
ple of each type. At the left, the modi󰏒ers of risk (in green) and hazard (in red) in the whole EEC are
contrasted. As it can be observed, these two semantically related terms tend to co-occur with di󰏑erent
modi󰏒ers, although they also share some of them (in white). At the center, there is a sketch di󰏑 that
shows how water takes di󰏑erent verbs as an object in Hydrology (in green) and Water Treatment and
Figure 9: Sample of sketch di󰏑s extracted from EEC.
899Lexicography in gLobaL contexts
Supply (in red), as well as a considerable number of shared results. Finally, the sketch di󰏑 at the right
outlines the verbs that tend to have gas as subject in singular (in green) and in plural (in red) in the
whole EEC.
3.3 Word List
The Word list feature can be used to extract frequency lists with many di󰏑erent settings including
n-gram extraction, 󰏒ltering based on regular expressions or keyword extraction with the aid of a us-
er-chosen reference corpus. This feature can be used in combination with an EEC subcorpus, which
allows the user to generate very speci󰏒c frequency lists. Some examples of frequency lists that could
be useful to generate from the EEC are: nouns speci󰏒c to Energy Engineering academic texts using
the British National Corpus as a reference; most common 4-grams in Zoology texts; adjectives con-
taining -friendly in the whole EEC; or the most common verbs in Geology texts (Figure 10).
Figure 10: Frequency list of verbs in Geology texts.
4 Conclusion
In this paper, we have shown how the EEC was built and compiled and how it can be queried and
exploited in Sketch Engine. The EEC’s metadata, the default sketch grammar and the ESSG make
the EEC a useful resource for any user interested in environmental science. As future work, we will
re󰏒ne, improve and update the ESSG and develop new rules for Spanish. Furthermore, in the short
term, we plan to upload an improved version of the EEC (with more words and some minor codi󰏒-
cation issues solved) and a 󰏒rst version of the Spanish counterpart. In the long run, we will enhance
the EEC with a new annotated version, where di󰏑erent semantic tags will be added to improve its
querying potential. These semantic tags will include semantic categories and argument structure.
900 Proceedings of the XViii eUrALeX internAtionAL congress
Sketch Engine’s API also allows for the exploitation of the EEC from external applications. An
example of this is EcoLexiCAT, a terminology-enhanced computer assisted translation (CAT) tool
that provides easy access to domain-speci󰏒c terminological knowledge in context (León-Araúz &
Reimerink, 2018; León-Araúz, Reimerink & Faber, 2017). EcoLexiCAT integrates di󰏑erent features
of the professional translation work󰏓ow in a stand-alone interface where a source text is interactively
enriched with terminological information (i.e., de󰏒nitions, translations, images, compound terms,
corpus access, etc.) from EcoLexicon, BabelNet, IATE, and Sketch Engine. In the Sketch Engine
module of EcoLexiCAT’s interface, terms from both the source and target segments can be selected
and direct access is given to concordances, CQL queries and word sketches of the selected terms. For
a more detailed analysis, the output of the queries can be opened in a new tab that sends users to the
website of the Sketch Engine Open Corpora.
References
Faber, P., León-Araúz, P. & Reimerink, A. (2016). EcoLexicon : New Features and Challenges. In GLOBALEX
2016: Lexicographic Resources for Human Language Technology in conjunction with LREC 2016, pp. 73–80.
Jakubíček, M., Kilgarri󰏑, A., McCarthy, D. & Rychlý, P. (2010). Fast syntactic searching in very large corpora for
many languages. Proceedings of the PACLIC 24, pp. 741–747.
Kilgarri󰏑, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P. & Suchomel, V. (2014). The
Sketch Engine: ten years on. Lexicography, 1(1), pp. 7–36. http://doi.org/10.1007/s40607-014-0009-9
Kilgarri󰏑, A., Kovár, V., Krek, S., Srdanovic, I. & Tiberius, C. (2010). A Quantitative Evaluation of Word Sketch-
es. In A. Dykstra & T. Schoonheim (Eds.), Proceedings of the 14th EURALEX International Congress. pp.
372–379. Leeuwarden/Ljouwert, The Netherlands: Fryske Akademy.
Kosem, I., Gantar, P., Logar, N. & Krek, S. (2014). Automation of Lexicographic Work Using General and Special-
ized Corpora: Two Case Studies. In A. Abel, C. Vettori & N. Ralli (Eds.), Proceedings of the XVI EURALEX
International Congress: The User in Focus, pp. 355–364. Bolzano/Bozen: Institute for Specialised Commu-
nication and Multilingualism.
León-Araúz, P. & Reimerink, A. (2018). Evaluating EcoLexiCAT: a Terminology-Enhanced CAT Tool. In Pro-
ceedings of the 11th International Language Resources and Evaluation Conference (LREC2018). Miyazaki:
ELRA.
León-Araúz, P., Reimerink, A. & Faber, P. (2017). EcoLexiCAT: a Terminology-enhanced Translation Tool for
Texts on the Environment. In I. Kosem, C. Tiberius, M. Jakubíček, J. Kallas, S. Krek & V. Baisa (Eds.), Elec-
tronic lexicography in the 21st century. Proceedings of eLex 2017 conference, pp. 321–341. Leiden: Lexical
Computing.
León-Araúz, P. & San Martín, A. (2018). The EcoLexicon Semantic Sketch Grammar: from Knowledge Patterns to
Word Sketches. In I. Kerneman & S. Krek (Eds.), Proceedings of the LREC 2018 Workshop “Globalex 2018
– Lexicography & WordNets”, pp. 94–99. Miyazaki: Globalex.
León-Araúz, P., San Martín, A. & Faber, P. (2016). Pattern-based Word Sketches for the Extraction of Semantic Re-
lations. In Proceedings of the 5th International Workshop on Computational Terminology, pp. 73–82. Osaka.
Meyer, I. (2001). Extracting knowledge-rich contexts for terminography. In D. Bourigault, C. Jacquemin & M.-C.
L’Homme (Eds.), Recent advances in computational terminology, pp. 279–302. Amsterdam/Philadelphia:
John Benjamins.
San Martín, A., Cabezas-García, M., Buendía Castro, M., Sánchez Cárdenas, B., León-Araúz, P. & Faber, P. (2017).
Recent Advances in EcoLexicon. Dictionaries: Journal of the Dictionary Society of North America, 38(1),
pp. 96–115.
Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press.
Teubert, W. (2005). My version of corpus linguistics. International Journal of Corpus Linguistics, 10(1), 1–13.
http://doi.org/10.1075/ijcl.10.1.01teu
901Lexicography in gLobaL contexts
Acknowledgements
This research was carried out as part of projects FF2014-52740-P, Cognitive and Neurological Bas-
es for Terminology-enhanced Translation (CONTENT), and FFI2017-89127-P, Translation-oriented
Terminology Tools for Environmental Texts (TOTEM), funded by the Spanish Ministry of Economy
and Competitiveness.
... Besides the massive publication of CL as a method in language studies, the present paper takes a different trajectory by discussing the available features of Sketch Engine, its application in recent publications, and its potential for Indonesian research. Though some previous works have discussed CL features in general [34], [35], they did not link the CL tools, particularly Sketch Engine, in Indonesia's research. By doing so, we expect to open a future discussion and collaborative work of CL in various language fields. ...
... Resonating with previous research [17], [19], [26], [34], [35], this paper agrees that Sketch Engine is a sophisticated tool for not only researchers interested in lexicography but also critical discourse studies and translation. These findings further support integrating big data, computers, and technology in language and applied language research. ...
... Corpora of written texts provide an immediate and large source of data for detecting patterns of polysemy. Considering resources for written discourse, there are now rich repositories for scientific articles and monographs such as JSTOR, PubMed, Hathitrust, and Arxiv, and extensive sources for semantic information about terms, such as WordNet (Miller, 1995), Wikipedia, and terminology projects such as Leon-Arauz et al. (2018). Natural language processing (NLP) in computer science has clear potential for developing methods and tools that enable the scaling of qualitative analyses of texts to large text corpora. ...
... Determining when peoples' understandings of the terms' meanings are sufficiently distinct to merit formal differentiation into senses is also an important topic for study. Prior work in lexical semantics has been centrally concerned with the challenges of individuating senses for a word (Geeraerts & Geeraerts, 1997;Kilgarriff, 1997) and methods for translating technical terminology into different human languages (Leon-Arauz et al., 2018;L' Homme et al., 2020). In computer science, the ontology alignment problem aims to specify semantic relationships between terms defined in different classification systems, and multiple algorithms have been developed for aligning biomedical or anatomical trait ontologies (Bertone et al., 2013;Dragisic et al., 2017;Oliveira & Pesquita, 2018). ...
Article
Full-text available
The idea that ambiguity can be productive in data science remains controversial. Efforts to make scientific publications and data intelligible to computers generally assume that accommodating multiple meanings for words, known as polysemy, undermines reasoning and communication. This assumption has nonetheless been contested by historians, philosophers, and social scientists, who have applied qualitative research methods to demonstrate the generative and strategic value of polysemy. Recent quantitative results from linguistics have also shown how polysemy can actually improve the efficiency of human communication. I present a new conceptual typology based on a synthesis of prior research about the aims, norms, and circumstances under which polysemy arises and is evaluated. The typology supports a contextual pluralist view of polysemy’s value for scientific research practices: polysemy does both substantial positive and negative work in science, but its utility is context-sensitive in ways that are often overlooked by the norms people have formulated to regulate its use, including prior scholars researching polysemy. I also propose that historical patterns in the use of partial synonyms, i.e. terms with overlapping meanings, provide an especially promising phenomenon for integrative research addressing these issues.
... Kilgarriff et al. 2014) was used. As a reference corpus, we used the EcoLexicon Environmental Corpus (EEC, 23 million words; León- Araúz et al. 2018) available in the Open Corpora section of Sketch Engine, and we compared it to a corpus specifically created for this purpose: the Environmental Law corpus (enLaw, 9.7 million words), composed of EEC texts, tagged with the domain of environmental law, as well as additional texts from the same domain harvested from the Internet. Some texts of the enLaw corpus are also included in the complete corpus on environmental science. ...
Article
Full-text available
Despite its importance, environmental law has largely been ignored in environmental knowledge bases. This may be due to the fact that legal issues may not, strictly speaking, be considered scientific knowledge in environmental knowledge resources, which may in turn relate to the complexity of reflecting the cultural component (which includes different legal systems) in the description of terms and concepts. The terminological knowledge base EcoLexicon has recently begun to include information on environmental law. This paper takes the methodological perspective of frame-based terminology to analyze typical verb collocations in environmental law that will be added to the phraseology module of EcoLexicon. Corpus analysis was used to compare the behavior of verbs collocating with pollution in environmental science and environmental law. Verbs were classified based on lexical domains and semantic classes through definition factorization, as described in the Lexical Grammar Model. The differences were mostly based on the specificity of the arguments and the emphasis on the polluter in environmental law. This resulted in a proposal for the inclusion and configuration of environmental law phraseology in EcoLexicon, showing sociocultural differences across environmental subdomains.
... This sub-corpus, hereafter referred to as COCA-A, contains approximately 120 million words from articles published in more than 100 peer-reviewed journals. Additionally, a sub-corpus focused upon Climatology created from the English EcoLexicon Corpus (León-Araúz, Martin, & Reimerink, 2018) was also used for comparative purposes. The EcoLexicon Corpus is a 23-million word corpus of texts such as journal articles and books from 1973-2016 from areas such as oceanography, biology, climatology, ecology, etc. ...
Article
Full-text available
This diachronic corpus-based analysis investigates the use of epistemic stance devices inreports from the United Nations’ Intergovernmental Panel on Climate Change (IPCC)from the date of its first report on the physical science of climate change in 1990 to itssixth contribution in 2021. Applying the framework of stance (Biber & Finegan, 1989),the study focuses upon changes in the use of the epistemic stance markers of modal verbsand adverbs across the approximately 30-year period. To empirically measure thestrength of trends in the use of these stance devices, Kendall’s Tau correlation coefficientwas calculated for each item using their normalized frequencies from the six reports.Analysis displayed that the use of modal verbs has consistently decreased across thisperiod of time in which the scientific consensus regarding the anthropogenic origins ofclimate change expanded and solidified. Additionally, of the greater than twenty stanceadverbs displaying consistent use trajectories across the period, the majority of theseitems were emphatic adverbs declining in use.
Article
Full-text available
This paper explains conceptual modeling within the framework of Frame-Based Terminology (Faber, 2012; 2015; 2022), as applied to EcoLexicon (ecolexicon.ugr.es), a specialized knowledge base on the environment (León-Araúz, Reimerink &, Faber, 2019; Faber & León-Araúz, 2021). It describes how a frame-based terminological resource is currently being restructured and reengineered as an initial step towards its formalization and subsequent transformation into an ontology. It also explains how the information in EcoLexicon can be integrated in environmental ontologies such as ENVO (Buttigieg, Morrison, Smith, Mungall & Lewis, 2013; Buttigieg, Pafilis, Lewis, Schildhauer, Walls & Mungall, 2016), particularly at the bottom tiers of the Ontology Learning Layer Cake (Cimiano, 2006; Cimiano, Maedche, Staab & Volker, 2009). The assumption is that frames, as a conceptual modeling tool, and information extracted from corpora can be used to represent the conceptual structure of a specialized domain.
Book
This book explores the pragmatics of specialized language with a focus on multiword terms, complex phrases characterized by sequences of nouns or adjectives whose meaning is clarified in the unspecified but implicit links between them, with implications for their use and translation. The volume adopts an innovative approach rooted in Frame-Based Terminology which allows for the analysis of multiword – compound terms in specialized language, such as horizontal-axis wind turbine – term formation from an integrated semantic and pragmatic perspective. The book features data from a corpus on wind power in English, Spanish, and French comprising such specialized texts as research articles, books, reports, and PhD theses to consider term extraction and the identification of terminological correspondences. Cabezas-García highlights the ways in which pragmatic analysis is an integral part of understanding multiword terms, due to the necessary inference of information implicit within them, with applications for future research on pragmatics and specialized language more broadly. This book will be of interest to students and researchers in pragmatics, semantics, corpus linguistics, and terminology.
Article
Full-text available
Machine translation (MT) post-editing is an increasingly common practice in the translation industry which is also slowly being applied in the development of terminological resources. However, more studies have been devoted to analyze the practice in a translation scenario than in a terminographic context. Consequently, term-oriented post-editing guidelines are a current need if terminographers are also to become post-editors. With a view to enhancing the multilingual representation of environmental multiword terms (MWTs) in terminological resources, we analyze English-Spanish MWT translation in various generic MT systems. Our aims are: (1) to evaluate MT output in order to check whether it can be of any help to terminographers' work; (2) to develop an error typology in order to raise terminographers' awareness; and (3) to use the error typology to sketch a series of basic pre-editing and post-editing rules in a terminographic scenario. A comparison of MT output with the equivalents found in a comparable corpus is also presented. Even though MT often presents errors or unidiomatic choices, it can still serve as a basis for human post-editing, and provided that post-editors are familiarized with the potential errors. Comparable corpora, on the other hand, offer better results, but searches are more time-consuming and equivalents are not always available.
Chapter
Full-text available
Los términos poliléxicos son uno de los principales problemas de traducción en los textos especializados. Su tratamiento implica su correcta identificación, comprensión y traslado a la lengua meta. Dado que los recursos terminológicos no siempre facilitan estas tareas (debido a diferentes factores relacionados con estos términos o a la propia naturaleza de los recursos), el traductor debe buscar respuestas en otros medios, como los corpus. Para ello, es fundamental dominar diversas técnicas de interrogación de corpus, cuyo desconocimiento a menudo genera reticencia al uso de estas herramientas (Bowker 2004; Gallego-Hernández 2015). Con el fin de facilitar el tratamiento de los términos poliléxicos, en este estudio presentamos una serie de pasos en forma de procedimiento que permiten comprender y traducir estos términos del inglés hacia el español con la ayuda de corpus paralelos y comparables. Abstract Multiword terms can be problematic when translating specialized texts. Their treatment involves correctly identifying and understanding them, as well as translating them to the target language. Since terminological resources do not always facilitate these tasks (due to different characteristics of multiword terms or the nature of these resources), translators must use other resources, such as corpora. However, this requires knowledge of corpora querying, of which not all translators have a good command, thus resulting often in reluctance to corpora (Bowker 2004; Gallego-Hernández 2015). With a view to facilitating multiword term management, this study describes a step-by-step protocol that allows to understand and translate these terms from English into Spanish using parallel and comparable corpora. 1. INTRODUCCIÓN El tratamiento de los términos poliléxicos (p. ej. UV-absorbing aerosol), que constituyen las principales unidades fraseológicas del discurso especializado, es uno de los grandes escollos en cualquier proyecto de traducción. Dicho tratamiento consta generalmente de tres fases, que comportan dificultades de diversa índole: su identificación (p. ej. delimitación del término poliléxico), su comprensión (p. ej. desambiguación estructural y semántica) y su reproducción en la lengua meta (LM) (p. ej. búsqueda de equivalentes y discriminación entre variantes). El primer paso para dar respuesta a estas cuestiones constituye, lógicamente, la consulta de recursos terminológicos. Sin embargo, estos términos no siempre se incluyen o lo hacen de forma poco sistemática. Además, tal y como sostiene Bowker (2011), los traductores ya no consultan los recursos con la misma «fe ciega» que antes, por lo que muchos optan por usar sus propuestas para iniciar nuevas búsquedas en otros recursos. De este modo, el traductor deberá dominar diversas técnicas para resolver los problemas que generan los términos poliléxicos, valiéndose de herramientas como los corpus. Tradicionalmente, los traductores han recurrido a textos paralelos 1 1 Sánchez Gijón (2009) define los textos paralelos como aquellos textos que, en relación con el texto origen, proporcionan información sobre las convenciones textuales o las particularidades de los usos lingüísticos
Conference Paper
Full-text available
Many projects have applied knowledge patterns (KPs) to the retrieval of specialized information. Yet terminologists still rely on manual analysis of concordance lines to extract semantic information, since there are no user-friendly publicly available applications enabling them to find knowledge rich contexts (KRCs). To fill this void, we have created the KP-based EcoLexicon Semantic Sketch Grammar (ESSG) in the well-known corpus query system Sketch Engine. For the first time, the ESSG is now publicly available in Sketch Engine to query the EcoLexicon English Corpus. Additionally, reusing the ESSG in any English corpus uploaded by the user enables Sketch Engine to extract KRCs codifying generic-specific, part-whole, location, cause and function relations, because most of the KPs are domain-independent. The information is displayed in the form of summary lists (word sketches) containing the pairs of terms linked by a given semantic relation. This paper describes the process of building a KP-based sketch grammar with special focus on the last stage, namely, the evaluation with refinement purposes. We conducted an initial shallow precision and recall evaluation of the 64 English sketch grammar rules created so far for hyponymy, meronymy and causality. Precision was measured based on a random sample of concordances extracted from each word sketch type. Recall was assessed based on a random sample of concordances where known term pairs are found. The results are necessary for the improvement and refinement of the ESSG. The noise of false positives helped to further specify the rules, whereas the silence of false negatives allows us to find useful new patterns.
Article
Full-text available
EcoLexicon is a multilingual terminological knowledge base on the environment. It is the practical application of Frame-based Terminology, a cognitive approach to the representation of specialized knowledge. Recent enhancements include the EcoLexicon English corpus, a phraseological module, and a flexible approach to terminological definitions.
Conference Paper
Full-text available
Despite advances in computer technology, terminologists still tend to rely on manual work to extract all the semantic information that they need for the description of specialized concepts. In this paper we propose the creation of new word sketches in Sketch Engine for the extraction of semantic relations. Following a pattern-based approach, new sketch grammars are developed in order to extract some of the most common semantic relations used in the field of terminology: generic-specific, part-whole, location, cause and function.
Conference Paper
Full-text available
EcoLexicon is a terminological knowledge base (TKB) on the environment with terms in six languages: English, French, German, Modern Greek, Russian, and Spanish. It is the practical application of Frame-based Terminology, which uses a modified version of Fillmore's frames coupled with premises from Cognitive Linguistics to configure specialized domains on the basis of definitional templates and create situated representations for specialized knowledge concepts. The specification of the conceptual structure of (sub)events and the description of the lexical units are the result of a top-down and bottom-up approach that extracts information from a wide range of resources. This includes the use of corpora, the factorization of definitions from specialized resources and the extraction of conceptual relations with knowledge patterns. Similarly to a specialized visual thesaurus, EcoLexicon provides entries in the form of semantic networks that specify relations between environmental concepts. All entries are linked to a corresponding (sub)event and conceptual category. In other words, the structure of the conceptual, graphical, and linguistic information relative to entries is based on an underlying conceptual frame. Graphical information includes photos, images, and videos, whereas linguistic information not only specifies the grammatical category of each term, but also phraseological, and contextual information. The TKB also provides access to the specialized corpus created for its development and a search engine to query it. One of the challenges for EcoLexicon in the near future is its inclusion in the Linguistic Linked Open Data Cloud.
Article
Full-text available
The Sketch Engine is a leading corpus tool, widely used in lexicography. Now, at 10 years old, it is mature software. The Sketch Engine website offers many ready-to-use corpora, and tools for users to build, upload and install their own corpora. The paper describes the core functions (word sketches, concordancing, thesaurus). It outlines the different kinds of users, and the approach taken to working with many different languages. It then reviews the kinds of corpora available in the Sketch Engine, gives a brief tour of some of the innovations from the last few years, and surveys other corpus tools and websites.
Article
Full-text available
For many linguistic investigations, the first step is to find examples. In the 21st century, they should all be found, not invented. Thus linguists need flexible tools for finding even quite rare phenomena. To support linguists well, they need to be fast even where corpora are very large and queries are complex. We present extensions to the CQL 'Corpus Query Language' for intuitive creation of syntactically rich queries, and demonstrate that they can be computed quickly within our tool even on multi-billion word corpora.
A Quantitative Evaluation of Word Sketches
  • A Kilgarriff
  • V Kovár
  • S Krek
  • I Srdanovic
  • C Tiberius
Kilgarriff, A., Kovár, V., Krek, S., Srdanovic, I. & Tiberius, C. (2010). A Quantitative Evaluation of Word Sketches. In A. Dykstra & T. Schoonheim (Eds.), Proceedings of the 14th EURALEX International Congress. pp. 372-379. Leeuwarden/Ljouwert, The Netherlands: Fryske Akademy.
Automation of Lexicographic Work Using General and Specialized Corpora: Two Case Studies
  • I Kosem
  • P Gantar
  • N Logar
  • S Krek
Kosem, I., Gantar, P., Logar, N. & Krek, S. (2014). Automation of Lexicographic Work Using General and Specialized Corpora: Two Case Studies. In A. Abel, C. Vettori & N. Ralli (Eds.), Proceedings of the XVI EURALEX International Congress: The User in Focus, pp. 355-364. Bolzano/Bozen: Institute for Specialised Communication and Multilingualism.