Xabier Saralegi's research while affiliated with Elhuyar Fundazioa and other places

This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

Fig. 1 Comparison of correct answers (percentage) on the gdr-src+tgt test sets for Basque to Spanish translation using correct or incorrect context.

Fig. 4 Percentage of times each model answers correctly to one, two or none of the instances of the opposite pairs of the gdr-src+tgt test.

presents corpus statistics for EhuHAC, indicating the number of aligned sentences for different context sizes in Basque-Spanish (EU-ES) and Basque-French (EU-FR).

presents the statistics for the EiTB corpus, in terms of number of sentence pairs given the number of context sentences.

Corpora statistics in terms of number of sentences pairs.

+1

TANDO^+: Corpus and Baselines for Document-level Machine Translation in Basque-Spanish and Basque-French

October 2023

·

2 Reads

·

Thierry Etchegoyhen

·

·

[...]

·

Context-aware Neural Machine Translation can potentially enhance automated translation quality through effective modelling of context beyond the sentence level. However, suitable corpora for contextual modelling are still scarce, presenting a significant challenge for the training and evaluation of context-aware systems. To address this challenge, we describe \textsc{tando^+}, a document-level corpus for the under-resourced language pairs Basque-French and Basque-Spanish. We provide a detailed description of this corpus, which is to be shared with the scientific community. The corpus comprises parallel data from diverse domains (literature, subtitles, and news) and incorporates context-level information. Additionally, it provides manually crafted contrastive test sets for Basque-Spanish, designed for comprehensive assessment of gender and register contextual phenomena. Additionally, we train and evaluate sentence-level baseline models and several state-of-the-art contextual variants. Our results and analyses indicate that the corpus is well-suited to train and evaluate context-aware machine translation systems for the two selected under-resourced language pairs.

Scaling Laws for BERT in Low-Resource Settings

January 2023

·

16 Reads

·

Iñaki San Vicente

·

Xabier Saralegi

·

[...]

·

Not Enough Data to Pre-train Your Language Model? MT to the Rescue!

Conference Paper

January 2023

·

2 Reads

·

1 Citation

·

Iñaki San Vicente

·

Xabier Saralegi

·

Information retrieval and question answering: A case study on COVID-19 scientific literature

December 2021

·

84 Reads

·

14 Citations

Knowledge-Based Systems

·

Iñaki San Vicente

·

Xabier Saralegi

·

[...]

·

Biosanitary experts around the world are directing their efforts towards the study of COVID-19. This effort generates a large volume of scientific publications at a speed that makes the effective acquisition of new knowledge difficult. Therefore, Information Systems are needed to assist biosanitary experts in accessing, consulting and analyzing these publications. In this work we develop a study of the variables involved in the development of a Question Answering system that receives a set of questions asked by experts about the disease COVID-19 and its causal virus SARS-CoV-2, and provides a ranked list of expert-level answers to each question. In particular, we address the interrelation of the Information Retrieval and the Answer Extraction steps. We found that a recall based document retrieval that leaves to a neural answer extraction module the scanning of the whole documents to find the best answer is a better strategy than relying in a precise passage retrieval before extracting the answer span.

Fine-Tuning BERT for COVID-19 Domain Ad-Hoc IR by Using Pseudo-qrels

March 2021

·

81 Reads

·

1 Citation

Xabier Saralegi

·

Iñaki San Vicente

This work analyzes the feasibility of training a neural retrieval system for a collection of scientific papers about COVID-19 using pseudo-qrels extracted from the collection. We propose a method for generating pseudo-qrels that exploits two characteristics present in scientific articles: a) the relationship between title and abstract, and b) the relationship between articles through sentences containing citations. Through these signals we generate pseudo-queries and their respective pseudo-positive (relevant documents) and pseudo-negative (non-relevant documents) examples. The article retrieval process combines a ranking model based on term-maching techniques and a neural one based on pretrained BERT models. BERT models are fine-tuned to the task using the pseudo-qrels generated. We compare different BERT models, both open domain and biomedical domain, and also the generated pseudo-qrels with the open domain MS-Marco dataset for fine-tuning the models. The results obtained on the TREC-COVID collection show that pseudo-qrels provide a significant improvement to neural models, both against classic IR baselines based on term-matching and neural systems trained on MS-Marco.

Basque NER results on EIEC corpus.

Give your Text Representation Models some Love: the Case for Basque

March 2020

·

143 Reads

·

Iñaki San Vicente

·

Jon Ander Campos

·

[...]

·

Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their own. This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora. In addition, monolingual pre-trained models for non-English languages are not always available. At best, models for those languages are included in multilingual versions, where each language shares the quota of substrings and parameters with the rest of the languages. This is particularly true for smaller languages such as Basque. In this paper we show that a number of monolingual models (FastText word embeddings, FLAIR and BERT language models) trained with larger Basque corpora produce much better results than publicly available versions in downstream NLP tasks, including topic classification, sentiment classification, PoS tagging and NER. This work sets a new state-of-the-art in those tasks for Basque. All benchmarks and models used in this work are publicly available.

Figure 1: Diagram of the modules composing Talaia platform architecture.)

Figure 2: Main page of the visualization interface for the monitorization of political campaign in the Basque elections of 2016. Upper left bar chart shows popularity rankings of the political parties. To its right, the time chart provides the evolution of the mentions taking the span of the whole campaign. Pie charts show the distribution of positive and negative mentions among the political parties. On the lower side of the figure a timeline of the most recent captured mentions is showed, classified by polarity.)

Talaia: a Real time Monitor of Social Media and Digital Press

September 2018

·

186 Reads

Iñaki San Vicente

·

Xabier Saralegi

·

Talaia is a platform for monitoring social media and digital press. A configurable crawler gathers content with respect to user defined domains or topics. Crawled data is processed by means of IXA-pipes NLP chain and EliXa sentiment analysis system. A Django powered interface provides data visualization to provide the user analysis of the data. This paper presents the architecture of the system and describes in detail the different components of the system. To prove the validity of the approach, two real use cases are accounted for, one in the cultural domain and one in the political domain. Evaluation for the sentiment analysis task in both scenarios is also provided, showing the capacity for domain adaptation.

the official results for the first 4 systems in the task.

EliXa: A Modular and Flexible ABSA Platform

February 2017

·

79 Reads

·

4 Citations

Iñaki San Vicente

·

Xabier Saralegi

·

This paper presents a supervised Aspect Based Sentiment Analysis (ABSA) system. Our aim is to develop a modular platform which allows to easily conduct experiments by replacing the modules or adding new features. We obtain the best result in the Opinion Target Extraction (OTE) task (slot 2) using an off-the-shelf sequence labeler. The target polarity classification (slot 3) is addressed by means of a multiclass SVM algorithm which includes lexical based features such as the polarity values obtained from domain and open polarity lexicons. The system obtains accuracies of 0.70 and 0.73 for the restaurant and laptop domain respectively, and performs second best in the out-of-domain hotel, achieving an accuracy of 0.80.

TASS: Detecting Sentiments in Spanish Tweets

August 2015

·

118 Reads

·

6 Citations

Xabier Saralegi

·

Iñaki San Vicente

EliXa: A modular and flexible ABSA platform

June 2015

·

169 Reads

·

54 Citations

Iñaki San Vicente

·

Xabier Saralegi

·

This paper presents a supervised Aspect Based Sentiment Analysis (ABSA) system. Our aim is to develop a modular platform which allows to easily conduct experiments by replacing the modules or adding new features. We obtain the best result in the Opinion Target Extraction (OTE) task (slot 2) using an off-the-shelf sequence labeler. The target polarity classification (slot 3) is addressed by means of a multiclass SVM algorithm which includes lexical based features such as the polarity values obtained from domain and open polarity lexicons. The system obtains accuracies of 0.70 and 0.73 for the restaurant and laptop domain respectively, and performs second best in the out-of-domain hotel, achieving an accuracy of 0.80.

... In general, the performance of a model tends to improve with an increase in data volume [25][26][27] . This is because larger-scale data provides more information, helping the model better learn features and patterns while reducing the risk of overfitting. ...
Reference:
CPMI-ChatGLM: parameter-efficient fine-tuning ChatGLM with Chinese patent medicine instructions

Not Enough Data to Pre-train Your Language Model? MT to the Rescue!

Citing Conference Paper
January 2023

·

Iñaki San Vicente

·

Xabier Saralegi

·

... Information is the new currency in today's world. Extracting knowledge from the data is a tedious task, so there is a need to build a system that can extract information/knowledge from the given user input for different applications such as COVID-19 [1] and agriculture [2]. So, a question-answering system (QA) can be developed that can take user queries as well as a piece of text i.e., a paragraph in natural language and can provide relevant answers from the given paragraph. ...
Reference:
Improving the BERT model for long text sequences in question answering domain

Information retrieval and question answering: A case study on COVID-19 scientific literature

Citing Article
Full-text available
December 2021

Knowledge-Based Systems

·

Iñaki San Vicente

·

Xabier Saralegi

·

[...]

·

... As mentioned in Section 3.1 our IR module follows a two step approach [51]: ...
Reference:
Information retrieval and question answering: A case study on COVID-19 scientific literature

Fine-Tuning BERT for COVID-19 Domain Ad-Hoc IR by Using Pseudo-qrels

Citing Chapter
Full-text available
March 2021

Xabier Saralegi

·

Iñaki San Vicente

... Providing additional features in the form of POS tags does also improve the model's performance. While we still per- form not quite as good as the current state-of-the-art system EliXa [25] we do perform well with respect to the overall ranking of the SemEval2015 task as can be seen in Pontiki et al. [22]. In spite of being a few percentage points below the current state- of-the-art system EliXa, our proposed method still constitutes a meaningful contribution. ...
Reference:
Aspect-Based Relational Sentiment Analysis Using a Stacked Neural Network Architecture

EliXa: A Modular and Flexible ABSA Platform

Citing Article
Full-text available
February 2017

Iñaki San Vicente

·

Xabier Saralegi

·

... The other strategy, manual validation, means manually validating the top n candidate terms. This is usually done by either a domain-expert or a terminologist Drouin, 2003;Frantzi & Ananiadou, 1999;Gurrutxaga et al., 2013;Haque et al., 2018). The two most important drawbacks of this method are that it only evaluates precision, not recall, and that the annotated data is not easily reusable. ...
Reference:
D-TERMINE : data-driven term extraction methodologies investigated

Automatic Comparable Web Corpora Collection and Bilingual Terminology Extraction for Specialized Dictionary Making

Citing Chapter
December 2013

Antton Gurrutxaga

·

·

Xabier Saralegi

·

Iñaki San Vicente

... Sentiment analysis was performed with the R extension package, "Sentiment Analysis" [66]. For the English reviews, the package's base dictionary was used; for Spanish, the ElhPolar dictionary [67] and for Portuguese, the SentiLex-PT 02 dictionary [68]. The analysis was performed by sentences (taking advantage of the text annotation from the previous procedure), applying the "ruleSentimentPolarity" method, which assigns a value between −1 and 1 to each review, being −1, very negative and 1, very positive. ...
Reference:
Exploring User-Generated Content for Improving Destination Knowledge: The Case of Two World Heritage Cities

Elhuyar at TASS 2013

Citing Conference Paper
Full-text available
September 2013

Xabier Saralegi

·

Iñaki San Vicente

... Examples are a YouTube corpus consisting of English and Italian comments (Uryupina et al., 2014), a not publicly available German Amazon review corpus of 270 sentences (Boland et al., 2013), in addition to the USAGE corpus (Klinger and Cimiano, 2014) we have used in this work, consisting of German and English reviews. The (non-fine-grained annotated) Spanish TASS corpus consists of Twitter messages (Saralegi and Vicente, 2012). The "Multilingual Subjectivity Analysis Gold Standard Data Set" focuses on subjectivity in the news domain (Balahur and Steinberger, 2009). ...
Reference:
Instance Selection Improves Cross-Lingual Model Training for Fine-Grained Sentiment Analysis

TASS: Detecting Sentiments in Spanish Tweets

Citing Data
File available
August 2015

Xabier Saralegi

·

Iñaki San Vicente

... CRF models were found in Toh and Su (2015) and Brun et al. (2016). Vicente et al. (2017) and Wagner et al. (2014) employed SVM models. Further, for mining movie reviews through aspect-based sentiment analysis Manek et al. (2017) employed SVM classifier, andParkhe andBiswas (2016) utilized Naive Bayes method. ...
Reference:
Passenger intelligence as a competitive opportunity: unsupervised text analytics for discovering airline-specific insights from online reviews

EliXa: A modular and flexible ABSA platform

Citing Conference Paper
Full-text available
June 2015

Iñaki San Vicente

·

Xabier Saralegi

·

... Automated drafting of bilingual dictionary content may significantly ease the manual effort required to make dictionaries from scratch. As earlier experiments have shown, even for a relatively marginal language-pair like German-Basque, one can obtain equivalent candidates for around two thirds of the initial lemma list (Lindemann et al., 2014). But, in any case, it is not only the recall on the initial word lists that automated drafting methods may offer, but it is also, of course, the precision, that is, in our case, the adequacy of the draft equivalent pairs that makes the difference: for the production of a dictionary that deserves this name, as long as automated efforts continue to fail to achieve precision rates approaching 100%, manual editing of the draft data seems indispensable. ...
Reference:
Bilingual Dictionary Drafting: Bootstrapping WordNet and BabelNet

Bilingual Dictionary Drafting. The Example of German-Basque, a Medium-density Language Pair

Citing Conference Paper
Full-text available
July 2014

David Lindemann

·

·

·

[...]

·

Xabier Saralegi

... There are no proceedings that describe the features of each system, there are, only some technical notes that can be found on the TASS2012 3 web site. The system presented by Saralegi and San Vicente (2012) achieved the best results. It used a sequential minimal optimization (SMO) implementation of the support vector machine algorithm using Weka (Frank et al. 2016) and a polarity lexicon that was constructed from automatically translated English Lexicons and words extracted from a training corpus. ...
Reference:
Spanish sentiment analysis in Twitter at the TASS workshop

TASS: Detecting Sentiments in Spanish Tweets

Citing Article
Full-text available

Xabier Saralegi

·

Iñaki San Vicente

·

Vicente Elhuyar Fundazioa