Xabier Saralegi's research while affiliated with Elhuyar Fundazioa and other places

What is this page?


This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

Publications (35)


Sentimenduen analisirako lexikoen sorkuntza
  • Conference Paper
  • Full-text available

May 2015

·

117 Reads

·

Xabier Saralegi

Testuetan adierazten diren sentimendu eta iritziak automatikoki aztertzeko oinarrizko baliabideak dira polaritate-lexikoak. Euskaraz, horrelako teknologia garatzeko ahaleginak oso urriak izan dira orain arte. Artikulu honetan lexiko horiek modu automatikoan sortzen hiru bide aztertu dira: beste hizkuntzetan dauden lexikoak itzultzea, testu-corpusetatik erauztea, eta WordNet moduko ezagutza base eleaniztunen gainean sentimenduak markatzea. Emaitzek erakusten dute metodo hauek baliagarri dire-la polaritate-lexiko eraginkorrak hutsetik modu azkar batean eta adituen ahalegin handirik gabe sortzeko. Polarity lexicons are a basic resource for analyzing the sentiments and opinions expressed in texts in an automated way. Very little work has been done on this regard for Basque. This paper explores three methods to automatically construct polarity lexicons: translating existing lexicons from other languages, extracting polarity lexicons from corpora, and annotating sentiments in WordNet likeMultilingual Lexical Knowledge Bases. Results show that these methods are useful for creating lexicons from scratch fast and with little effort from human experts. 1 Sarrera eta motibazioa Iritzi-erauzketa eta sentimenduen analisiaren motibazioa domeinu komertzial eta politikoak aztertzeko aplikazioen beharretik dator. Aplikazio horien helburua gizartearen sentimendu eta jarrerak era automa-tikoan jarraitzea litzateke, berri, foro, eta abarren bidez. Zein da gizarteak Ukrainiako gatazkari buruz duen iritzia? Zein da jendeak marka batekiko duen harrera? Eta modelo zehatz bat kaleratu ondoren? Testuetatik abiatuz iritziak eta emozioak identifikatuko lituzkeen sistema bat oso baliagarria litzateke horrelako galderei erantzun ahal izateko. Sentimenduen analisiaren alorrak azken urteetan izugarrizko bultzada izan du, hainbat jardueratan oso interesgarriak baitira, hala nola zaintza teknologikoan, marketin alorrean produktu zein enpresen inguruko iritzia ezagutzeko, pertsonen gaineko izen ona aztertzeko, gai gatazkatsuen inguruko erreakzioak antzemateko, eta abar. Ikerketa-ildo hori azkenaldian horrenbeste hazi izana Web 2.0ren etorrerarekin lotu behar da. Internet berriak erabiltzaileei edukiak sortzeko ahalmena eman die. Orain arte, produktu, erakunde edo gai baten inguruan gizartearen iritzia inkesten eta arreta zerbitzuen bidez bildu izan da, baina horrek erabiltzailea eta enpresaren zuzeneko harremana eskatzen zuen. Erabiltzaileok, baina, ez ditugu bide horiek askotan erabiltzen, askoz ohikoagoa da gure iritzia lagunartean adieraztea. Orain gutxi arte informazio hori eskuratzea oso zaila zen enpresentzako, baina gaur egungo Internetek horrelako informazioa gordetzen du eta eskuragarri jartzen du edozeinentzako. Iritzi-erauzketak datu masa erraldoi horretatik informazioa

Download
Share

Fig. 1: Aligned data of 3 WordNets. 
table 4 ): 
Figure 3 of 3
Bilingual Dictionary Drafting. The Example of German-Basque, a Medium-density Language Pair

July 2014

·

156 Reads

·

6 Citations

·

·

Rogelio Nazar

·

[...]

·

Xabier Saralegi

This paper presents a set of Bilingual Dictionary Drafting (BDD) methods including manual extraction from existing lexical databases and corpus based NLP tools, as well as their evaluation on the example of German-Basque as language pair. Our aim is twofold: to give support to a German-Basque bilingual dictionary project by providing draft Bilingual Glossaries and to provide lexicographers with insight into how useful BDD methods are. Results show that the analysed methods can greatly assist on bilingual dictionary writing, in the context of medium-density language pairs.


Automatic Comparable Web Corpora Collection and Bilingual Terminology Extraction for Specialized Dictionary Making

December 2013

·

30 Reads

·

3 Citations

In this article we describe two tools we have built, one for compiling comparable corpora out of the Internet and the other for bilingual terminology extraction out of comparable corpora, and an evaluation we have subjected them to: bilingual terminology has been extracted out of automatically collected domain comparable web corpora, in Basque and English, and the resulting terminology lists have been validated automatically using a specialized dictionary, in order to evaluate the quality of the extracted terminology lists. Thus, this evaluation measures the usefulness of putting these two automatic tools to work together in a real-world task, that is, specialized dictionary making.


Table 3). 
Elhuyar at TASS 2013

September 2013

·

220 Reads

·

25 Citations

This article describes the system presented for the task of sentiment analysis in the TASS 2013 evaluation campaign. We adopted a supervised approach that includes some linguistic knowledge-based processing for preparing the features. The processing comprises lemmatisation, POS tagging, tagging of polarity words, treatment of emoticons and treatment of negation. A pre-processing for treatment of spell-errors is also performed. Detection of polarity words is done according to a polarity lexicon built in two ways: projection to Spanish of an English lexicon, and extraction of divergent words of positive and negative tweets of training corpus. The system achieves an 60% accuracy fine granularity and an 68% accuracy for coarse granularity polarity detection.


Cross-Lingual Projections vs. Corpora Extracted Subjectivity Lexicons for Less-Resourced Languages

March 2013

·

34 Reads

·

6 Citations

Subjectivity tagging is a prior step for sentiment annotation. Both machine learning based approaches and linguistic knowledge based ones profit from using subjectivity lexicons. However, most of these kinds of resources are often available only for English or other major languages. This work analyses two strategies for building subjectivity lexicons in an automatic way: by projecting existing subjectivity lexicons from English to a new language, and building subjectivity lexicons from corpora. We evaluate which of the strategies performs best for the task of building a subjectivity lexicon for a less-resourced language (Basque). The lexicons are evaluated in an extrinsic manner by classifying subjective and objective text units belonging to various domains, at document- or sentence-level. A manual intrinsic evaluation is also provided which consists of evaluating the correctness of the words included in the created lexicons.


Table 2 . Selected test words from SemCor and their sense distribution 
Table 5 . Pearson correlation of sense distribution regarding to SemCor 
Table 6 . 
Analyzing the Sense Distribution of Concordances Obtained by Web as Corpus Approach

In corpus-based lexicography and natural language processing fields some authors have proposed using the Internet as a source of corpora for obtaining concordances of words. Most techniques implemented with this method are based on information retrieval-oriented web searchers. However, rankings of concordances obtained by these search engines are not built according to linguistic criteria but to topic similarity or navigational oriented criteria, such as page-rank. It follows that examples or concordances could not be linguistically representative, and so, linguistic knowledge mined by these methods might not be very useful. This work analyzes the linguistic representativeness of concordances obtained by different relevance criteria based web search engines (web, blog and news search engines). The analysis consists of comparing web concordances and SemCor (the reference) with regard to the distribution of word senses. Results showed that sense distributions in concordances obtained by web search engines are, in general, quite different from those obtained from the reference corpus. Among the search engines, those that were found to be the most similar to the reference were the informational oriented engines (news and blog search engines).


Automatic Comparable Web Corpora Collection and Bilingual Terminology Extraction for Specialized Dictionary Making

January 2013

·

190 Reads

·

1 Citation



Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries

January 2011

·

217 Reads

·

23 Citations

An A-C bilingual dictionary can be inferred by merging A-B and B-C dictionaries using B as pivot. However, polysemous pivot words often produce wrong translation candidates. This paper analyzes two methods for pruning wrong candidates: one based on exploiting the structure of the source dictionaries, and the other based on distributional similarity computed from comparable corpora. As both methods depend exclusively on easily available resources, they are well suited to less resourced languages. We studied whether these two techniques complement each other given that they are based on different paradigms. We also researched combining them by looking for the best adequacy depending on various application scenarios. 1


Fig. 1. Example of a document expansion (doc_id: jrc31998D0293­en.xml, p_id: 17).  
Table 1 . Free parameters described in Section 2.4. λ is not used in run 2.
Table 2 . Results for submitted runs
Table 3 . Comparison between the two runs per language pair
Number of questions answered correctly in the monolingual run alone, in the cross- lingual run alone, and in both runs
Document Expansion for Cross-Lingual Passage Retrieval

October 2010

·

83 Reads

·

2 Citations

This article describes the participation of the joint Elhuyar-IXA group in the ResPubliQA exercise at QA&CLEF 2010. In particular, we participated in the English–English monolingual task and in the Basque– English cross-lingual one. Our focus was threefold: (1) to check to what extent information retrieval (IR) can achieve good results in passage retrieval without question analysis and answer validation, (2) to check dictionary techniques for Basque to English retrieval when faced with the lack of parallel corpora for Basque in this domain, and (3) to check the contribution of semantic relatedness based on WordNet to expand the passages to related words. Our results show that IR provides good results in the monolingual task, that our performance drop in the cross-lingual system was much greater than in previous CLIR experiments, and that expansion improves the results in the monolingual task. Keywords: Cross-lingual passage retrieval, semantic relatedness, word cooccurrences. 1


Citations (27)


... In general, the performance of a model tends to improve with an increase in data volume [25][26][27] . This is because larger-scale data provides more information, helping the model better learn features and patterns while reducing the risk of overfitting. ...

Reference:

CPMI-ChatGLM: parameter-efficient fine-tuning ChatGLM with Chinese patent medicine instructions
Not Enough Data to Pre-train Your Language Model? MT to the Rescue!
  • Citing Conference Paper
  • January 2023

... Information is the new currency in today's world. Extracting knowledge from the data is a tedious task, so there is a need to build a system that can extract information/knowledge from the given user input for different applications such as COVID-19 [1] and agriculture [2]. So, a question-answering system (QA) can be developed that can take user queries as well as a piece of text i.e., a paragraph in natural language and can provide relevant answers from the given paragraph. ...

Information retrieval and question answering: A case study on COVID-19 scientific literature

Knowledge-Based Systems

... Providing additional features in the form of POS tags does also improve the model's performance. While we still per- form not quite as good as the current state-of-the-art system EliXa [25] we do perform well with respect to the overall ranking of the SemEval2015 task as can be seen in Pontiki et al. [22]. In spite of being a few percentage points below the current state- of-the-art system EliXa, our proposed method still constitutes a meaningful contribution. ...

EliXa: A Modular and Flexible ABSA Platform

... The other strategy, manual validation, means manually validating the top n candidate terms. This is usually done by either a domain-expert or a terminologist Drouin, 2003;Frantzi & Ananiadou, 1999;Gurrutxaga et al., 2013;Haque et al., 2018). The two most important drawbacks of this method are that it only evaluates precision, not recall, and that the annotated data is not easily reusable. ...

Automatic Comparable Web Corpora Collection and Bilingual Terminology Extraction for Specialized Dictionary Making
  • Citing Chapter
  • December 2013

... Sentiment analysis was performed with the R extension package, "Sentiment Analysis" [66]. For the English reviews, the package's base dictionary was used; for Spanish, the ElhPolar dictionary [67] and for Portuguese, the SentiLex-PT 02 dictionary [68]. The analysis was performed by sentences (taking advantage of the text annotation from the previous procedure), applying the "ruleSentimentPolarity" method, which assigns a value between −1 and 1 to each review, being −1, very negative and 1, very positive. ...

Elhuyar at TASS 2013

... Examples are a YouTube corpus consisting of English and Italian comments (Uryupina et al., 2014), a not publicly available German Amazon review corpus of 270 sentences (Boland et al., 2013), in addition to the USAGE corpus (Klinger and Cimiano, 2014) we have used in this work, consisting of German and English reviews. The (non-fine-grained annotated) Spanish TASS corpus consists of Twitter messages (Saralegi and Vicente, 2012). The "Multilingual Subjectivity Analysis Gold Standard Data Set" focuses on subjectivity in the news domain (Balahur and Steinberger, 2009). ...

TASS: Detecting Sentiments in Spanish Tweets

... CRF models were found in Toh and Su (2015) and Brun et al. (2016). Vicente et al. (2017) and Wagner et al. (2014) employed SVM models. Further, for mining movie reviews through aspect-based sentiment analysis Manek et al. (2017) employed SVM classifier, andParkhe andBiswas (2016) utilized Naive Bayes method. ...

EliXa: A modular and flexible ABSA platform

... Automated drafting of bilingual dictionary content may significantly ease the manual effort required to make dictionaries from scratch. As earlier experiments have shown, even for a relatively marginal language-pair like German-Basque, one can obtain equivalent candidates for around two thirds of the initial lemma list (Lindemann et al., 2014). But, in any case, it is not only the recall on the initial word lists that automated drafting methods may offer, but it is also, of course, the precision, that is, in our case, the adequacy of the draft equivalent pairs that makes the difference: for the production of a dictionary that deserves this name, as long as automated efforts continue to fail to achieve precision rates approaching 100%, manual editing of the draft data seems indispensable. ...

Bilingual Dictionary Drafting. The Example of German-Basque, a Medium-density Language Pair

... There are no proceedings that describe the features of each system, there are, only some technical notes that can be found on the TASS2012 3 web site. The system presented by Saralegi and San Vicente (2012) achieved the best results. It used a sequential minimal optimization (SMO) implementation of the support vector machine algorithm using Weka (Frank et al. 2016) and a polarity lexicon that was constructed from automatically translated English Lexicons and words extracted from a training corpus. ...

TASS: Detecting Sentiments in Spanish Tweets