Conference PaperPDF Available

The EcoLexicon English Corpus as an Open Corpus in Sketch Engine

July 2018

July 2018

Conference: XVIII EURALEX International Congress: Lexicography in Global Contexts
At: Ljubljana

Authors:

Pilar León Araúz

University of Granada

Antonio San Martín

Université du Québec à Trois-Rivières

Arianne Reimerink

University of Granada

The EcoLexicon English Corpus (EEC) is a 23.1-million-word corpus of contemporary environmental texts. It was compiled by the LexiCon research group for the development of EcoLexicon (Faber, León-Araúz & Reimerink 2016; San Martín et al. 2017), a terminological knowledge base on the environment. It is available as an open corpus in the well-known corpus query system Sketch Engine (Kilgarriff et al. 2014), which means that any user, even without a subscription, can freely access and query the corpus. In this paper, the EEC is introduced by describing how it was built and compiled and how it can be queried and exploited, based both on the functionalities provided by Sketch Engine and on the parameters in which the texts in the EEC are classified.

Sketch Engine's Search and EEC Text types.

…

Frequency of photovoltaic in the EEC according to environmental subdomain.

…

Word sketches of methane extracted from the EEC.

…

Semantic word sketches of methane extracted from the EEC.

…

Frequency list of verbs in Geology texts.

…

Figures - uploaded by Antonio San Martín

Content may be subject to copyright.

Content uploaded by Antonio San Martín

Content may be subject to copyright.

893Lexicography in gLobaL contexts

The EcoLexicon English Corpus as an Open Corpus in Sketch

Engine

Pilar León-Araúz1, Antonio San Martín2, Arianne Reimerink1

1Department of Translation and Interpreting, University of Granada

2Department of Modern Languages and Translation, University of Quebec in Trois-Rivières

E-mail: pleon@ugr.es, antonio.san.martin.pizarro@uqtr.ca, arianne@ugr.es

Abstract

The EcoLexicon English Corpus (EEC) is a 23.1-million-word corpus of contemporary environmental texts. It

was compiled by the LexiCon research group for the development of EcoLexicon (Faber, León-Araúz & Reimer-

ink 2016; San Martín et al. 2017), a terminological knowledge base on the environment. It is available as an open

corpus in the well-known corpus query system Sketch Engine (Kilgarri󰏑 et al. 2014), which means that any user,

even without a subscription, can freely access and query the corpus. In this paper, the EEC is introduced by de-

scribing how it was built and compiled and how it can be queried and exploited, based both on the functionalities

provided by Sketch Engine and on the parameters in which the texts in the EEC are classi󰏒ed.

Keywords: specialized open corpus, terminology, corpus exploitation

1 Introduction

Corpora have become a key element of almost all language studies, as any assertion about language

requires veri󰏒cation through real linguistic data to be deemed credible (Teubert 2005: 1). Having

access to general and specialized corpora is thus essential for anyone involved in research or any

professional activity related to language. However, many of these professionals do not have time to

compile large corpora. The EcoLexicon English Corpus (EEC) is a 23.1-million-word specialized

corpus of contemporary environmental texts. It was compiled by the LexiCon research group for the

development of EcoLexicon (Faber et al. 2016; San Martín et al. 2017), a terminological knowledge

base on the environment.1 In EcoLexicon, the EEC and its Spanish counterpart (together over 50 mil-

lion words) can be queried with pragmatic restrictions such as author, date of publication, target

reader, contextual domain, and keywords. However, its search engine does not provide all the func-

tionalities of the well-known corpus tool Sketch Engine (Kilgarri󰏑 et al. 2014). This is why the EEC

was made available as an open corpus in Sketch Engine, which means that any user, even without a

subscription, can freely access and query the corpus.2 One very interesting module provided by the

query system is information extraction through word sketches, which are automatic corpus-derived

summaries of a word’s grammatical and collocational behavior (Kilgarri󰏑 et al. 2010). Apart from the

built-in word sketches, Sketch Engine allows users to customize sketches for their speci󰏒c needs. In

the case of the EEC, this has enhanced the extraction of semantic information.

In this paper, the EEC is introduced by describing how the corpus was built and compiled (Section 2),

and how it can be queried and exploited (Section 3), based on the functionalities provided by Sketch

Engine, the parameters in which the texts in the EEC are classi󰏒ed and the word sketches exclusively

created for the EEC. Finally, Section 4 o󰏑ers some concluding remarks.

1 EcoLexicon is freely available at <http://ecolexicon.ugr.es>.

2 Certain advanced functionalities are only available for subscribed users.

894 Proceedings of the XViii eUrALeX internAtionAL congress

2 Creating the EcoLexicon English Corpus

The EEC is a 23.1-million-word corpus of contemporary environmental texts. It was 󰏒rst created as

an internal tool for knowledge extraction while building EcoLexicon. However, it was made public-

ly available because it evolved to be a tool in itself that terminologists, translators or even experts

could exploit for di󰏑erent purposes (i.e. modeling, comprehension and production tasks) within the

specialized domain of the environment. As Sinclair (1991: 24) pointed out, we should not expect a

general reference corpus like the British National Corpus to adequately document specialized genres

and domains. It follows that we need more specialized corpora, compiled with enough texts and text

types to represent a knowledge domain, as they are more likely to document the conventions of the

genre and the concepts and terms of the domain.

Each text in the EEC is tagged according to a set of XML-based metadata, some of which are based

on the Dublin Core Schema, while others have been included to meet the needs of the research group.

Corpus metadata permit users to constrain corpus queries based on pragmatic factors, such as envi-

ronmental domains and target reader. Thus, for instance, the use of the same term in di󰏑erent contexts

can be compared. Tags are based on the following main parameters:

• Domain: the EEC encompasses all the domains and subdomains of environmental studies (e.g.,

Biology, Meteorology, Ecology, Environmental Engineering, Environmental Law, etc.).

• User: the corpus includes texts for three types of user, depending on level of expertise (i.e., ex-

pert, semi-expert, general public).

• Geographical variant: it comprises American, British, and Euro English.

• Genre: it covers a wide variety of text genres (e.g., journal articles, books, websites, lexicograph-

ical material, etc.).

• Editor: it distinguishes texts edited by scholars/researchers, businesses, government bodies, etc.

• Year: it includes texts from 1973 to 2016.

• Country: the texts are tagged according to the country of publication.

The EEC was processed and compiled in an internal application of the research group. Then it was

recompiled within Sketch Engine with the Penn Treebank tagset (TreeTagger version 3.3) and with

the EcoLexicon Semantic Sketch Grammar (ESSG) (León-Araúz & San Martín 2018; León-Araúz,

San Martín & Faber 2016), a CQL-based (Corpus Query Language) (Jakubíček et al. 2010) custom-

ized sketch grammar separate from the default sketch grammar. The ESSG was developed for the

extraction of semantic word sketches based on some of the most common semantic relations in termi-

nology: generic-speci󰏒c, part-whole, location, cause, and function.

When a corpus is compiled with a collection of di󰏑erent pattern-based grammar rules such as the

above, new word sketches can be queried within the Sketch Engine (see Section 3.2). The ESSG

thus has three aims: (1) extracting semantic relations for building EcoLexicon; (2) o󰏑ering seman-

tic word sketches in the EEC; and (3) providing other users with the possibility of reusing them in

their own corpora.3

3 Exploiting the EcoLexicon English Corpus

The combination of pragmatic, syntactic and semantic information that can be extracted from the cor-

pus makes the EEC an adequate resource for all kinds of end users with an interest in environmental

science, such as domain experts, professional writers, translators, terminologists, ESP researchers,

3 The latest version of the ESSG can be downloaded from <http://ecolexicon.ugr.es/essg/>.

895Lexicography in gLobaL contexts

etc., as stated above. Thanks to Sketch Engine’s automation capabilities, users are able to analyze and

extract a sizable quantity of linguistic data that would have been unmanageable in the past (Kosem

et al. 2014: 362). In the following sections, di󰏑erent queries will be provided by combining the main

functionalities of Sketch Engine with the parameters according to which the EEC is tagged.4

3.1 Search and Text Types

The feature Search is the main way to access concordances in Sketch Engine. Di󰏑erent types of que-

ries are possible (simple, lemma, phrase, word, character and CQL), and they can be combined with

the contextual 󰏒lter, which allows the user to limit the lemmas that should appear around the word

or words of the query. Additionally, in the case of the EEC, any query performed through the Search

feature can be 󰏒ltered according to text type based on the tagging of the EEC (domain, genre, editor,

etc.) (Figure 1).

The 󰏒ltering by text type can be chosen manually for each query. However, the user can also create

subcorpora based on text types. For instance, a user may want to create a simple subcorpus for the

domains of Hydrology or Renewable Energy, or complex subcorpora ,such as one containing only

articles and books in British English from the domain of Biology for experts in the 󰏒eld. Additionally,

the EEC comes with several subcorpora created by default (i.e. American English, British English,

Year 1973–1999, Year 2000–2009 and Year 2010–2016).

Figure 1: Sketch Engine’s Search and EEC Text types.

All these possibilities of query customization allow the user to retrieve, for instance, all the concord-

ances where recycle is a verb in texts addressed to the general public (lemma search 󰏒ltered by user)

or where climate change occurs in Environmental Law texts (phrase search 󰏒ltered by domain). Addi-

tionally, the Context option can be combined with any search, permitting the user to 󰏒nd, for example,

all the concordances in Oceanography academic articles where the lemma wind appears in a window

of ±15 tokens of the lemma wave.

However, given that the EEC was recompiled with TreeTagger, it is possible to perform more 󰏒ne-

grained queries in CQL, allowing for the formalization of grammar patterns in the form of regu-

lar expressions combined with POS-tags. CQL queries used together with text-type 󰏒ltering are a

powerful tool to research the workings of environmental English. An example of a CQL query is

([tag=”N.*”] [lemma=”amount” & tag=”N.*”]) | ([lemma=”amount” & tag=”N.*”] [word=”of”]

[tag=”N.*”]), which 󰏒nds concordances of the lemma amount either preceded by any noun or fol-

lowed by of and any noun. Figure 2 shows a sample of the resulting concordances limited to the

Meteorology subdomain.

4 Due to space restrictions, no instructions are provided. However, interested readers can consult the user-friendly Sketch Engine

manual at: < http://sketchengine.co.uk/user-guide/>

896 Proceedings of the XViii eUrALeX internAtionAL congress

Figure 2: Sample of the results for the CQL query amount preceded by a noun or followed by of and any

noun in the Meteorology subdomain.

With CQL queries, a user can also compare the frequency of di󰏑erent variants of multiword expres-

sions. For example, in the term geologic time scale, geologic can be replaced by geological and time

scale can be written as a single word. With the CQL query [lemma=”geologic.*”] ([lemma=”ti-

mescale”]|([lemma=”time”] [lemma=”scale”])) we can retrieve all the concordances where all the

variants appear, and with the Frequency – Node forms feature we can see which form is more frequent

(Figure 3).

Figure 3: Frequency of variants of geologic time scale in the EEC.

Another feature of Sketch Engine that permits users to fully exploit the EEC is Frequency – Text type.

With this feature, users can observe how language expression changes across di󰏑erent levels of ex-

pertise in the environmental domain. For instance, when searching for the verb liquefy, concordances

can be 󰏒ltered according to the user type parameter. Not surprisingly, the verb appears more often in

expert-related texts than in texts addressed to the general public (Figure 4).

Figure 4: Frequency of liquefy in the EEC according to user type.

897Lexicography in gLobaL contexts

With this feature the frequency of terms in di󰏑erent domains can also be observed, thus verifying if

a term is more speci󰏒c to one domain or another. For instance, by searching the lemma photovoltaic

and looking up its frequency according to domain, the results show that it is a term mainly linked to

the domain of Renewable Energy, although it also occurs, but with much lower frequency, in Clima-

tology and Air Quality Management (Figure 5).

Figure 5: Frequency of photovoltaic in the EEC according to environmental subdomain.

3.2 Word Sketch and Sketch Di

The EEC employs both the default sketch grammar for English underlying the word sketches in the

tool in combination with the ESSG. Users can bene󰏒t from Sketch Engine’s default word sketches

when searching for the collocations that are used more often in specialized discourse in combination

with a certain term. For instance, Figure 6 shows the modi󰏒ers of methane, the nouns modi󰏒ed by

methane and the verbs that collocate with methane both as object and subject.

Figure 6: Word sketches of methane extracted from the EEC.

Thanks to the ESSG, users can access ready-made semantic word sketches such as those shown in

Figure 7, where search terms may appear related to their hyponyms (i.e. microorganism), the whole

they are part of (i.e. oxygen), their underlying causes (i.e. tsunami), etc.

Figure 7: Semantic word sketches of methane extracted from the EEC.

The word sketch queries can be complemented with the text type 󰏒lters provided by the tags of the

EEC (or subcorpora based on them). In this sense, users can also observe how concepts can change

their relational behavior across di󰏑erent environmental subdomains. For example, Figure 8 shows

how nitrogen is mainly categorized as a type of pollutant in the domain of Air Quality Management

and as a type of nutrient in that of Biology.

898 Proceedings of the XViii eUrALeX internAtionAL congress

Figure 8: Nitrogen generic-speci󰏒c semantic word sketches in Air Quality Management (left)

and Biology (right) subcorpora.

Additionally, if users access the concordances extracted with the ESSG, they can extract knowl-

edge-rich contexts (i.e. contexts containing domain knowledge potentially useful for conceptual anal-

ysis (Meyer 2001)) like the ones in Table 1.

Table 1: Sample of knowledge-rich contexts extracted from the EEC with the aid of the ESSG.

generic-

specic

A hydrograph is a graph that reects the discharge of a river over a period of time.

The astronomical tide refers to the regular oscillations of the sea or ocean surface[…].

part-whole Sand grains usually consist of quartz but may also be fragments of feldspar, mica, and, […].

Seawater contains sodium chloride and other salts in concentrations three times greater

[…].

location Lagoons commonly form on coastlines that are subsiding, or where sea level is rising.

Most ozone is found in the stratosphere at elevations between 10 and 50 kilometers […].

cause […] the human costs of malaria outweigh the environmental damage caused by the use of

DDT.

Logging may also contribute to deforestation by making it easier for agriculture to […].

function Membrane-assisted BAC is used for the removal of priority pollutants from secondary […].

Liquid-in-glass thermometers are often used for measuring surface air temperature

because […].

Another word-sketch based feature that can be especially exploited with the EEC is Sketch di. It

allows the user to compare either the word sketches of two lemmas, the word sketches on the same

lemma in two subcorpora, or two di󰏑erent word forms of the same lemma. Figure 9 shows an exam-

ple of each type. At the left, the modi󰏒ers of risk (in green) and hazard (in red) in the whole EEC are

contrasted. As it can be observed, these two semantically related terms tend to co-occur with di󰏑erent

modi󰏒ers, although they also share some of them (in white). At the center, there is a sketch di󰏑 that

shows how water takes di󰏑erent verbs as an object in Hydrology (in green) and Water Treatment and

Figure 9: Sample of sketch di󰏑s extracted from EEC.

899Lexicography in gLobaL contexts

Supply (in red), as well as a considerable number of shared results. Finally, the sketch di󰏑 at the right

outlines the verbs that tend to have gas as subject in singular (in green) and in plural (in red) in the

whole EEC.

3.3 Word List

The Word list feature can be used to extract frequency lists with many di󰏑erent settings including

n-gram extraction, 󰏒ltering based on regular expressions or keyword extraction with the aid of a us-

er-chosen reference corpus. This feature can be used in combination with an EEC subcorpus, which

allows the user to generate very speci󰏒c frequency lists. Some examples of frequency lists that could

be useful to generate from the EEC are: nouns speci󰏒c to Energy Engineering academic texts using

the British National Corpus as a reference; most common 4-grams in Zoology texts; adjectives con-

taining -friendly in the whole EEC; or the most common verbs in Geology texts (Figure 10).

Figure 10: Frequency list of verbs in Geology texts.

4 Conclusion

In this paper, we have shown how the EEC was built and compiled and how it can be queried and

exploited in Sketch Engine. The EEC’s metadata, the default sketch grammar and the ESSG make

the EEC a useful resource for any user interested in environmental science. As future work, we will

re󰏒ne, improve and update the ESSG and develop new rules for Spanish. Furthermore, in the short

term, we plan to upload an improved version of the EEC (with more words and some minor codi󰏒-

cation issues solved) and a 󰏒rst version of the Spanish counterpart. In the long run, we will enhance

the EEC with a new annotated version, where di󰏑erent semantic tags will be added to improve its

querying potential. These semantic tags will include semantic categories and argument structure.

900 Proceedings of the XViii eUrALeX internAtionAL congress

Sketch Engine’s API also allows for the exploitation of the EEC from external applications. An

example of this is EcoLexiCAT, a terminology-enhanced computer assisted translation (CAT) tool

that provides easy access to domain-speci󰏒c terminological knowledge in context (León-Araúz &

Reimerink, 2018; León-Araúz, Reimerink & Faber, 2017). EcoLexiCAT integrates di󰏑erent features

of the professional translation work󰏓ow in a stand-alone interface where a source text is interactively

enriched with terminological information (i.e., de󰏒nitions, translations, images, compound terms,

corpus access, etc.) from EcoLexicon, BabelNet, IATE, and Sketch Engine. In the Sketch Engine

module of EcoLexiCAT’s interface, terms from both the source and target segments can be selected

and direct access is given to concordances, CQL queries and word sketches of the selected terms. For

a more detailed analysis, the output of the queries can be opened in a new tab that sends users to the

website of the Sketch Engine Open Corpora.

References

Faber, P., León-Araúz, P. & Reimerink, A. (2016). EcoLexicon : New Features and Challenges. In GLOBALEX

2016: Lexicographic Resources for Human Language Technology in conjunction with LREC 2016, pp. 73–80.

Jakubíček, M., Kilgarri󰏑, A., McCarthy, D. & Rychlý, P. (2010). Fast syntactic searching in very large corpora for

many languages. Proceedings of the PACLIC 24, pp. 741–747.

Kilgarri󰏑, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P. & Suchomel, V. (2014). The

Sketch Engine: ten years on. Lexicography, 1(1), pp. 7–36. http://doi.org/10.1007/s40607-014-0009-9

Kilgarri󰏑, A., Kovár, V., Krek, S., Srdanovic, I. & Tiberius, C. (2010). A Quantitative Evaluation of Word Sketch-

es. In A. Dykstra & T. Schoonheim (Eds.), Proceedings of the 14th EURALEX International Congress. pp.

372–379. Leeuwarden/Ljouwert, The Netherlands: Fryske Akademy.

Kosem, I., Gantar, P., Logar, N. & Krek, S. (2014). Automation of Lexicographic Work Using General and Special-

ized Corpora: Two Case Studies. In A. Abel, C. Vettori & N. Ralli (Eds.), Proceedings of the XVI EURALEX

International Congress: The User in Focus, pp. 355–364. Bolzano/Bozen: Institute for Specialised Commu-

nication and Multilingualism.

León-Araúz, P. & Reimerink, A. (2018). Evaluating EcoLexiCAT: a Terminology-Enhanced CAT Tool. In Pro-

ceedings of the 11th International Language Resources and Evaluation Conference (LREC2018). Miyazaki:

ELRA.

León-Araúz, P., Reimerink, A. & Faber, P. (2017). EcoLexiCAT: a Terminology-enhanced Translation Tool for

Texts on the Environment. In I. Kosem, C. Tiberius, M. Jakubíček, J. Kallas, S. Krek & V. Baisa (Eds.), Elec-

tronic lexicography in the 21st century. Proceedings of eLex 2017 conference, pp. 321–341. Leiden: Lexical

Computing.

León-Araúz, P. & San Martín, A. (2018). The EcoLexicon Semantic Sketch Grammar: from Knowledge Patterns to

Word Sketches. In I. Kerneman & S. Krek (Eds.), Proceedings of the LREC 2018 Workshop “Globalex 2018

– Lexicography & WordNets”, pp. 94–99. Miyazaki: Globalex.

León-Araúz, P., San Martín, A. & Faber, P. (2016). Pattern-based Word Sketches for the Extraction of Semantic Re-

lations. In Proceedings of the 5th International Workshop on Computational Terminology, pp. 73–82. Osaka.

Meyer, I. (2001). Extracting knowledge-rich contexts for terminography. In D. Bourigault, C. Jacquemin & M.-C.

L’Homme (Eds.), Recent advances in computational terminology, pp. 279–302. Amsterdam/Philadelphia:

John Benjamins.

San Martín, A., Cabezas-García, M., Buendía Castro, M., Sánchez Cárdenas, B., León-Araúz, P. & Faber, P. (2017).

Recent Advances in EcoLexicon. Dictionaries: Journal of the Dictionary Society of North America, 38(1),

pp. 96–115.

Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press.

Teubert, W. (2005). My version of corpus linguistics. International Journal of Corpus Linguistics, 10(1), 1–13.

http://doi.org/10.1075/ijcl.10.1.01teu

901Lexicography in gLobaL contexts

Acknowledgements

This research was carried out as part of projects FF2014-52740-P, Cognitive and Neurological Bas-

es for Terminology-enhanced Translation (CONTENT), and FFI2017-89127-P, Translation-oriented

Terminology Tools for Environmental Texts (TOTEM), funded by the Spanish Ministry of Economy

and Competitiveness.

Poster presented at the 18th EURALEX International Congress: The EcoLexicon English Corpus as an Open Corpus in Sketch Engine

Data

July 2018

Pilar León Araúz · Antonio San Martín · Arianne Reimerink

Download

Big Data, Computer, and Technology in Language Studies: The Potentials of Sketch Engine in Indonesia’s Research

Conference Paper

Sep 2023

Explaining ambiguity in scientific language

Article

Full-text available

Aug 2022
SYNTHESE

Beckett Sterner

The idea that ambiguity can be productive in data science remains controversial. Efforts to make scientific publications and data intelligible to computers generally assume that accommodating multiple meanings for words, known as polysemy, undermines reasoning and communication. This assumption has nonetheless been contested by historians, philosophers, and social scientists, who have applied qualitative research methods to demonstrate the generative and strategic value of polysemy. Recent quantitative results from linguistics have also shown how polysemy can actually improve the efficiency of human communication. I present a new conceptual typology based on a synthesis of prior research about the aims, norms, and circumstances under which polysemy arises and is evaluated. The typology supports a contextual pluralist view of polysemy’s value for scientific research practices: polysemy does both substantial positive and negative work in science, but its utility is context-sensitive in ways that are often overlooked by the norms people have formulated to regulate its use, including prior scholars researching polysemy. I also propose that historical patterns in the use of partial synonyms, i.e. terms with overlapping meanings, provide an especially promising phenomenon for integrative research addressing these issues.

Phraseology and Culture in Terminological Knowledge Bases: The Case of Pollution and Environmental Law

Article

Full-text available

Feb 2024

Despite its importance, environmental law has largely been ignored in environmental knowledge bases. This may be due to the fact that legal issues may not, strictly speaking, be considered scientific knowledge in environmental knowledge resources, which may in turn relate to the complexity of reflecting the cultural component (which includes different legal systems) in the description of terms and concepts. The terminological knowledge base EcoLexicon has recently begun to include information on environmental law. This paper takes the methodological perspective of frame-based terminology to analyze typical verb collocations in environmental law that will be added to the phraseology module of EcoLexicon. Corpus analysis was used to compare the behavior of verbs collocating with pollution in environmental science and environmental law. Verbs were classified based on lexical domains and semantic classes through definition factorization, as described in the Lexical Grammar Model. The differences were mostly based on the specificity of the arguments and the emphasis on the polluter in environmental law. This resulted in a proposal for the inclusion and configuration of environmental law phraseology in EcoLexicon, showing sociocultural differences across environmental subdomains.

Stance in climate science: A diachronic analysis of epistemic stance features in IPCC physical science reports

Article

Full-text available

Jan 2023

This diachronic corpus-based analysis investigates the use of epistemic stance devices inreports from the United Nations’ Intergovernmental Panel on Climate Change (IPCC)from the date of its first report on the physical science of climate change in 1990 to itssixth contribution in 2021. Applying the framework of stance (Biber & Finegan, 1989),the study focuses upon changes in the use of the epistemic stance markers of modal verbsand adverbs across the approximately 30-year period. To empirically measure thestrength of trends in the use of these stance devices, Kendall’s Tau correlation coefficientwas calculated for each item using their normalized frequencies from the six reports.Analysis displayed that the use of modal verbs has consistently decreased across thisperiod of time in which the scientific consensus regarding the anthropogenic origins ofclimate change expanded and solidified. Additionally, of the greater than twenty stanceadverbs displaying consistent use trajectories across the period, the majority of theseitems were emphatic adverbs declining in use.

From specialized knowledge frames to linguistically based ontologies

Article

Full-text available

Mar 2024
Appl Ontol

This paper explains conceptual modeling within the framework of Frame-Based Terminology (Faber, 2012; 2015; 2022), as applied to EcoLexicon (ecolexicon.ugr.es), a specialized knowledge base on the environment (León-Araúz, Reimerink &, Faber, 2019; Faber & León-Araúz, 2021). It describes how a frame-based terminological resource is currently being restructured and reengineered as an initial step towards its formalization and subsequent transformation into an ontology. It also explains how the information in EcoLexicon can be integrated in environmental ontologies such as ENVO (Buttigieg, Morrison, Smith, Mungall & Lewis, 2013; Buttigieg, Pafilis, Lewis, Schildhauer, Walls & Mungall, 2016), particularly at the bottom tiers of the Ontology Learning Layer Cake (Cimiano, 2006; Cimiano, Maedche, Staab & Volker, 2009). The assumption is that frames, as a conceptual modeling tool, and information extracted from corpora can be used to represent the conceptual structure of a specialized domain.

Masculinity and Identity in Irish Literature: Heroes, Lads, and Fathers

Book

Jan 2024

Cassandra S. Tully de Lope

The Pragmatics of Multiword Terms: The Impact of Context

Book

Feb 2024

Melania Cabezas-García

This book explores the pragmatics of specialized language with a focus on multiword terms, complex phrases characterized by sequences of nouns or adjectives whose meaning is clarified in the unspecified but implicit links between them, with implications for their use and translation. The volume adopts an innovative approach rooted in Frame-Based Terminology which allows for the analysis of multiword – compound terms in specialized language, such as horizontal-axis wind turbine – term formation from an integrated semantic and pragmatic perspective. The book features data from a corpus on wind power in English, Spanish, and French comprising such specialized texts as research articles, books, reports, and PhD theses to consider term extraction and the identification of terminological correspondences. Cabezas-García highlights the ways in which pragmatic analysis is an integral part of understanding multiword terms, due to the necessary inference of information implicit within them, with applications for future research on pragmatics and specialized language more broadly. This book will be of interest to students and researchers in pragmatics, semantics, corpus linguistics, and terminology.

Machine versus corpus-based translation of multiword terms

Article

Full-text available

Jul 2023

Machine translation (MT) post-editing is an increasingly common practice in the translation industry which is also slowly being applied in the development of terminological resources. However, more studies have been devoted to analyze the practice in a translation scenario than in a terminographic context. Consequently, term-oriented post-editing guidelines are a current need if terminographers are also to become post-editors. With a view to enhancing the multilingual representation of environmental multiword terms (MWTs) in terminological resources, we analyze English-Spanish MWT translation in various generic MT systems. Our aims are: (1) to evaluate MT output in order to check whether it can be of any help to terminographers' work; (2) to develop an error typology in order to raise terminographers' awareness; and (3) to use the error typology to sketch a series of basic pre-editing and post-editing rules in a terminographic scenario. A comparison of MT output with the equivalents found in a comparable corpus is also presented. Even though MT often presents errors or unidiomatic choices, it can still serve as a basis for human post-editing, and provided that post-editors are familiarized with the potential errors. Comparable corpora, on the other hand, offer better results, but searches are more time-consuming and equivalents are not always available.

Berber lexicography: semantic and morphological problems

Article

Full-text available

Jan 2019

Carla Ferreros

Procedimiento para la traducción de términos poliléxicos con la ayuda de corpus

Chapter

Full-text available

Sep 2021

Los términos poliléxicos son uno de los principales problemas de traducción en los textos especializados. Su tratamiento implica su correcta identificación, comprensión y traslado a la lengua meta. Dado que los recursos terminológicos no siempre facilitan estas tareas (debido a diferentes factores relacionados con estos términos o a la propia naturaleza de los recursos), el traductor debe buscar respuestas en otros medios, como los corpus. Para ello, es fundamental dominar diversas técnicas de interrogación de corpus, cuyo desconocimiento a menudo genera reticencia al uso de estas herramientas (Bowker 2004; Gallego-Hernández 2015). Con el fin de facilitar el tratamiento de los términos poliléxicos, en este estudio presentamos una serie de pasos en forma de procedimiento que permiten comprender y traducir estos términos del inglés hacia el español con la ayuda de corpus paralelos y comparables. Abstract Multiword terms can be problematic when translating specialized texts. Their treatment involves correctly identifying and understanding them, as well as translating them to the target language. Since terminological resources do not always facilitate these tasks (due to different characteristics of multiword terms or the nature of these resources), translators must use other resources, such as corpora. However, this requires knowledge of corpora querying, of which not all translators have a good command, thus resulting often in reluctance to corpora (Bowker 2004; Gallego-Hernández 2015). With a view to facilitating multiword term management, this study describes a step-by-step protocol that allows to understand and translate these terms from English into Spanish using parallel and comparable corpora. 1. INTRODUCCIÓN El tratamiento de los términos poliléxicos (p. ej. UV-absorbing aerosol), que constituyen las principales unidades fraseológicas del discurso especializado, es uno de los grandes escollos en cualquier proyecto de traducción. Dicho tratamiento consta generalmente de tres fases, que comportan dificultades de diversa índole: su identificación (p. ej. delimitación del término poliléxico), su comprensión (p. ej. desambiguación estructural y semántica) y su reproducción en la lengua meta (LM) (p. ej. búsqueda de equivalentes y discriminación entre variantes). El primer paso para dar respuesta a estas cuestiones constituye, lógicamente, la consulta de recursos terminológicos. Sin embargo, estos términos no siempre se incluyen o lo hacen de forma poco sistemática. Además, tal y como sostiene Bowker (2011), los traductores ya no consultan los recursos con la misma «fe ciega» que antes, por lo que muchos optan por usar sus propuestas para iniciar nuevas búsquedas en otros recursos. De este modo, el traductor deberá dominar diversas técnicas para resolver los problemas que generan los términos poliléxicos, valiéndose de herramientas como los corpus. Tradicionalmente, los traductores han recurrido a textos paralelos 1 1 Sánchez Gijón (2009) define los textos paralelos como aquellos textos que, en relación con el texto origen, proporcionan información sobre las convenciones textuales o las particularidades de los usos lingüísticos

The EcoLexicon Semantic Sketch Grammar: from Knowledge Patterns to Word Sketches

Conference Paper

Full-text available

Jan 2018

Many projects have applied knowledge patterns (KPs) to the retrieval of specialized information. Yet terminologists still rely on manual analysis of concordance lines to extract semantic information, since there are no user-friendly publicly available applications enabling them to find knowledge rich contexts (KRCs). To fill this void, we have created the KP-based EcoLexicon Semantic Sketch Grammar (ESSG) in the well-known corpus query system Sketch Engine. For the first time, the ESSG is now publicly available in Sketch Engine to query the EcoLexicon English Corpus. Additionally, reusing the ESSG in any English corpus uploaded by the user enables Sketch Engine to extract KRCs codifying generic-specific, part-whole, location, cause and function relations, because most of the KPs are domain-independent. The information is displayed in the form of summary lists (word sketches) containing the pairs of terms linked by a given semantic relation. This paper describes the process of building a KP-based sketch grammar with special focus on the last stage, namely, the evaluation with refinement purposes. We conducted an initial shallow precision and recall evaluation of the 64 English sketch grammar rules created so far for hyponymy, meronymy and causality. Precision was measured based on a random sample of concordances extracted from each word sketch type. Recall was assessed based on a random sample of concordances where known term pairs are found. The results are necessary for the improvement and refinement of the ESSG. The noise of false positives helped to further specify the rules, whereas the silence of false negatives allows us to find useful new patterns.

Recent Advances in EcoLexicon

Article

Full-text available

Oct 2017

EcoLexicon is a multilingual terminological knowledge base on the environment. It is the practical application of Frame-based Terminology, a cognitive approach to the representation of specialized knowledge. Recent enhancements include the EcoLexicon English corpus, a phraseological module, and a flexible approach to terminological definitions.

Pattern-based Word Sketches for the Extraction of Semantic Relations

Conference Paper

Full-text available

Dec 2016

Despite advances in computer technology, terminologists still tend to rely on manual work to extract all the semantic information that they need for the description of specialized concepts. In this paper we propose the creation of new word sketches in Sketch Engine for the extraction of semantic relations. Following a pattern-based approach, new sketch grammars are developed in order to extract some of the most common semantic relations used in the field of terminology: generic-specific, part-whole, location, cause and function.

EcoLexicon: New Features and Challenges

Conference Paper

Full-text available

May 2016

EcoLexicon is a terminological knowledge base (TKB) on the environment with terms in six languages: English, French, German, Modern Greek, Russian, and Spanish. It is the practical application of Frame-based Terminology, which uses a modified version of Fillmore's frames coupled with premises from Cognitive Linguistics to configure specialized domains on the basis of definitional templates and create situated representations for specialized knowledge concepts. The specification of the conceptual structure of (sub)events and the description of the lexical units are the result of a top-down and bottom-up approach that extracts information from a wide range of resources. This includes the use of corpora, the factorization of definitions from specialized resources and the extraction of conceptual relations with knowledge patterns. Similarly to a specialized visual thesaurus, EcoLexicon provides entries in the form of semantic networks that specify relations between environmental concepts. All entries are linked to a corresponding (sub)event and conceptual category. In other words, the structure of the conceptual, graphical, and linguistic information relative to entries is based on an underlying conceptual frame. Graphical information includes photos, images, and videos, whereas linguistic information not only specifies the grammatical category of each term, but also phraseological, and contextual information. The TKB also provides access to the specialized corpus created for its development and a search engine to query it. One of the challenges for EcoLexicon in the near future is its inclusion in the Linguistic Linked Open Data Cloud.

The Sketch Engine: Ten Years On

Article

Full-text available

Jul 2014

The Sketch Engine is a leading corpus tool, widely used in lexicography. Now, at 10 years old, it is mature software. The Sketch Engine website offers many ready-to-use corpora, and tools for users to build, upload and install their own corpora. The paper describes the core functions (word sketches, concordancing, thesaurus). It outlines the different kinds of users, and the approach taken to working with many different languages. It then reviews the kinds of corpora available in the Sketch Engine, gives a brief tour of some of the innovations from the last few years, and surveys other corpus tools and websites.

Fast syntactic searching in very large corpora for many languages

Article

Full-text available

May 2011

For many linguistic investigations, the first step is to find examples. In the 21st century, they should all be found, not invented. Thus linguists need flexible tools for finding even quite rare phenomena. To support linguists well, they need to be fast even where corpora are very large and queries are complex. We present extensions to the CQL 'Corpus Query Language' for intuitive creation of syntactically rich queries, and demonstrate that they can be computed quickly within our tool even on multi-billion word corpora.

My version of corpus linguistics

Article

Full-text available

Mar 2005

Wolfgang Teubert

Extracting knowledge-rich contexts for terminography

Chapter

Jan 2001

Ingrid Meyer

A Quantitative Evaluation of Word Sketches

Jan 2010
372-379

A Kilgarriff
V Kovár
S Krek
I Srdanovic
C Tiberius

Kilgarriff, A., Kovár, V., Krek, S., Srdanovic, I. & Tiberius, C. (2010). A Quantitative Evaluation of Word Sketches. In A. Dykstra & T. Schoonheim (Eds.), Proceedings of the 14th EURALEX International Congress. pp. 372-379. Leeuwarden/Ljouwert, The Netherlands: Fryske Akademy.

Automation of Lexicographic Work Using General and Specialized Corpora: Two Case Studies

Jan 2014
355-364

I Kosem
P Gantar
N Logar
S Krek

Kosem, I., Gantar, P., Logar, N. & Krek, S. (2014). Automation of Lexicographic Work Using General and Specialized Corpora: Two Case Studies. In A. Abel, C. Vettori & N. Ralli (Eds.), Proceedings of the XVI EURALEX International Congress: The User in Focus, pp. 355-364. Bolzano/Bozen: Institute for Specialised Communication and Multilingualism.

The EcoLexicon English Corpus as an Open Corpus in Sketch Engine

Abstract and Figures

Supplementary resource (1)

Recommended publications

The EcoLexicon English Corpus as an open corpus in Sketch Engine

High-density knowledge rich contexts

Translating environmental texts with EcoLexiCAT

The EcoLexicon Semantic Sketch Grammar: from Knowledge Patterns to Word Sketches