Xabier Saralegi's research while affiliated with Elhuyar Fundazioa and other places

Publications (35)

Sentimenduen analisirako lexikoen sorkuntza
  • Conference Paper
  • Full-text available

May 2015


117 Reads


Xabier Saralegi

Testuetan adierazten diren sentimendu eta iritziak automatikoki aztertzeko oinarrizko baliabideak dira polaritate-lexikoak. Euskaraz, horrelako teknologia garatzeko ahaleginak oso urriak izan dira orain arte. Artikulu honetan lexiko horiek modu automatikoan sortzen hiru bide aztertu dira: beste hizkuntzetan dauden lexikoak itzultzea, testu-corpusetatik erauztea, eta WordNet moduko ezagutza base eleaniztunen gainean sentimenduak markatzea. Emaitzek erakusten dute metodo hauek baliagarri dire-la polaritate-lexiko eraginkorrak hutsetik modu azkar batean eta adituen ahalegin handirik gabe sortzeko. Polarity lexicons are a basic resource for analyzing the sentiments and opinions expressed in texts in an automated way. Very little work has been done on this regard for Basque. This paper explores three methods to automatically construct polarity lexicons: translating existing lexicons from other languages, extracting polarity lexicons from corpora, and annotating sentiments in WordNet likeMultilingual Lexical Knowledge Bases. Results show that these methods are useful for creating lexicons from scratch fast and with little effort from human experts.


Fig. 1: Aligned data of 3 WordNets. 
table 4 ): 
Figure 3 of 3
Bilingual Dictionary Drafting. The Example of German-Basque, a Medium-density Language Pair

July 2014


156 Reads


6 Citations



Rogelio Nazar




Xabier Saralegi

This paper presents a set of Bilingual Dictionary Drafting (BDD) methods including manual extraction from existing lexical databases and corpus based NLP tools, as well as their evaluation on the example of German-Basque as language pair. Our aim is twofold: to give support to a German-Basque bilingual dictionary project by providing draft Bilingual Glossaries and to provide lexicographers with insight into how useful BDD methods are. Results show that the analysed methods can greatly assist on bilingual dictionary writing, in the context of medium-density language pairs.

Automatic Comparable Web Corpora Collection and Bilingual Terminology Extraction for Specialized Dictionary Making

December 2013


30 Reads


3 Citations

In this article we describe two tools we have built, one for compiling comparable corpora out of the Internet and the other for bilingual terminology extraction out of comparable corpora, and an evaluation we have subjected them to: bilingual terminology has been extracted out of automatically collected domain comparable web corpora, in Basque and English, and the resulting terminology lists have been validated automatically using a specialized dictionary, in order to evaluate the quality of the extracted terminology lists. Thus, this evaluation measures the usefulness of putting these two automatic tools to work together in a real-world task, that is, specialized dictionary making.

Table 3). 
Elhuyar at TASS 2013

September 2013


220 Reads


25 Citations

This article describes the system presented for the task of sentiment analysis in the TASS 2013 evaluation campaign. We adopted a supervised approach that includes some linguistic knowledge-based processing for preparing the features. The processing comprises lemmatisation, POS tagging, tagging of polarity words, treatment of emoticons and treatment of negation. A pre-processing for treatment of spell-errors is also performed. Detection of polarity words is done according to a polarity lexicon built in two ways: projection to Spanish of an English lexicon, and extraction of divergent words of positive and negative tweets of training corpus. The system achieves an 60% accuracy fine granularity and an 68% accuracy for coarse granularity polarity detection.

Cross-Lingual Projections vs. Corpora Extracted Subjectivity Lexicons for Less-Resourced Languages

March 2013


34 Reads


6 Citations

Subjectivity tagging is a prior step for sentiment annotation. Both machine learning based approaches and linguistic knowledge based ones profit from using subjectivity lexicons. However, most of these kinds of resources are often available only for English or other major languages. This work analyses two strategies for building subjectivity lexicons in an automatic way: by projecting existing subjectivity lexicons from English to a new language, and building subjectivity lexicons from corpora. We evaluate which of the strategies performs best for the task of building a subjectivity lexicon for a less-resourced language (Basque). The lexicons are evaluated in an extrinsic manner by classifying subjective and objective text units belonging to various domains, at document- or sentence-level. A manual intrinsic evaluation is also provided which consists of evaluating the correctness of the words included in the created lexicons.

Table 2 . Selected test words from SemCor and their sense distribution 
Table 5 . Pearson correlation of sense distribution regarding to SemCor 
Table 6 . 
Analyzing the Sense Distribution of Concordances Obtained by Web as Corpus Approach

In corpus-based lexicography and natural language processing fields some authors have proposed using the Internet as a source of corpora for obtaining concordances of words. Most techniques implemented with this method are based on information retrieval-oriented web searchers. However, rankings of concordances obtained by these search engines are not built according to linguistic criteria but to topic similarity or navigational oriented criteria, such as page-rank. It follows that examples or concordances could not be linguistically representative, and so, linguistic knowledge mined by these methods might not be very useful. This work analyzes the linguistic representativeness of concordances obtained by different relevance criteria based web search engines (web, blog and news search engines). The analysis consists of comparing web concordances and SemCor (the reference) with regard to the distribution of word senses. Results showed that sense distributions in concordances obtained by web search engines are, in general, quite different from those obtained from the reference corpus. Among the search engines, those that were found to be the most similar to the reference were the informational oriented engines (news and blog search engines).

Automatic Comparable Web Corpora Collection and Bilingual Terminology Extraction for Specialized Dictionary Making

January 2013


190 Reads


1 Citation

Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries

January 2011


217 Reads


23 Citations

An A-C bilingual dictionary can be inferred by merging A-B and B-C dictionaries using B as pivot. However, polysemous pivot words often produce wrong translation candidates. This paper analyzes two methods for pruning wrong candidates: one based on exploiting the structure of the source dictionaries, and the other based on distributional similarity computed from comparable corpora. As both methods depend exclusively on easily available resources, they are well suited to less resourced languages. We studied whether these two techniques complement each other given that they are based on different paradigms. We also researched combining them by looking for the best adequacy depending on various application scenarios. 1

Fig. 1. Example of a document expansion (doc_id: jrc31998D0293­en.xml, p_id: 17).  
Table 1 . Free parameters described in Section 2.4. λ is not used in run 2.
Table 2 . Results for submitted runs
Table 3 . Comparison between the two runs per language pair
Number of questions answered correctly in the monolingual run alone, in the cross- lingual run alone, and in both runs
Document Expansion for Cross-Lingual Passage Retrieval

October 2010


83 Reads


2 Citations

This article describes the participation of the joint Elhuyar-IXA group in the ResPubliQA exercise at QA&CLEF 2010. In particular, we participated in the English–English monolingual task and in the Basque– English cross-lingual one. Our focus was threefold: (1) to check to what extent information retrieval (IR) can achieve good results in passage retrieval without question analysis and answer validation, (2) to check dictionary techniques for Basque to English retrieval when faced with the lack of parallel corpora for Basque in this domain, and (3) to check the contribution of semantic relatedness based on WordNet to expand the passages to related words. Our results show that IR provides good results in the monolingual task, that our performance drop in the cross-lingual system was much greater than in previous CLIR experiments, and that expansion improves the results in the monolingual task. Keywords: Cross-lingual passage retrieval, semantic relatedness, word cooccurrences. 1

