Figure 2 - uploaded by Iman Saleh
Content may be subject to copyright.
SVMs try to maximize the margin of separation between positive and negative examples

SVMs try to maximize the margin of separation between positive and negative examples

Source publication
Conference Paper
Full-text available
In this paper, a method is presented to recognize multilingual Wikipedia named entity articles. This method classifies multilingual Wikipedia articles using a variety of structured and unstructured features and is aided by cross-language links and features in Wikipedia. Adding multilingual features helps boost classification accuracy and is shown t...

Similar publications

Article
Full-text available
This article discusses the usefulness of geo-linguistic analysis for Internet studies by presenting two techniques to frame and visualize the linguistic development of the World Wide Web, in particular the geo-linguistic development amongst different language versions of Wikipedia. An emergent research agenda has been set to explore the multilingua...
Article
Full-text available
We present the design of a project to develop Wikipedia content on general vaccine safety and the COVID-19 vaccines, specifically. This proposal describes what a team would need to distribute public health information in Wikipedia in multiple languages in response to a disaster or crisis, and to measure and report the communication impact of the sa...
Preprint
Full-text available
Cross-lingual Entity Linking (XEL) aims to ground entity mentions written in any language to an English Knowledge Base (KB), such as Wikipedia. XEL for most languages is challenging, owing to limited availability of resources as supervision. We address this challenge by developing the first XEL approach that combines supervision from multiple langu...
Preprint
Full-text available
Recent studies have shown that multilingual pretrained language models can be effectively improved with cross-lingual alignment information from Wikipedia entities. However, existing methods only exploit entity information in pretraining and do not explicitly use entities in downstream tasks. In this study, we explore the effectiveness of leveragin...
Preprint
Full-text available
Multilingual language models have been a crucial breakthrough as they considerably reduce the need of data for under-resourced languages. Nevertheless, the superiority of language-specific models has already been proven for languages having access to large amounts of data. In this work, we focus on Catalan with the aim to explore to what extent a m...

Citations

... Targeting non-English Wikipedia classifiers [21] and language independent feature sets was the evolution of this problem in order to use the same classifier in any Wikipedia language [22]. The problem then, extended to target finegrained classification. ...
... Forward feature selection was adopted to reduce the set of features in order to fit with a small number of training examples. We started to combine the features in one [23] 73% 61% 55% 36% Saleh et al. [22] 76% 64% 50% 33% Tardif et al. [24] 60% 52% 35% 21% Nothman et al. [7] 70% 58% 48% 33% Dakka & Cucerzan [20] (Baseline) 30% 25% 16% 11% ...
... Dakka and Cucerzan[14]trained SVM and Naïve Bayes classifiers by using page-based and context features, and their experimental results showed that structural features (such as the data in the tables) are distinctive in identifying the NE type of Wikipedia articles. Saleh et al.[16]extracted features from abstracts, infobox, category, and persondata structure, and improved the recall of different NE types, by using beta-gamma threshold adjustment. Tkatchenko et al.[17]adopted similar features to Tardif et al.[18]. ...
... Then, a multi-classifier was trained based on the given features. All experiments were conducted with the SVM algorithms by using the toolkit libSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) with linear kernels, which presented excellent performance in[14,16]. We evaluated the models by using 5-fold cross-validation and adopted the widely used Precision, Recall, and F1 to measure classification performance. ...
Article
Named entity classification of Wikipedia articles is a fundamental research area that can be used to automatically build large-scale corpora of named entity recognition or to support other entity processing, such as entity linking, as auxiliary tasks. This paper describes a method of classifying named entities in Chinese Wikipedia with fine-grained types. We considered multi-faceted information in Chinese Wikipedia to construct four feature sets, designed different feature selection methods for each feature, and fused different features with a vector space using different strategies. Experimental results show that the explored feature sets and their combination can effectively improve the performance of named entity classification.
... Beyond the newswire-based corpora, Wikipedia becomes more attractive for different NLP tasks. Some researchers have exploited the unrestricted accessibility of Wikipedia to establish an automatic fully annotated NE corpus with different granularity; meanwhile others are merely focusing on partially utilising Wikipedia to achieve specific goals, such as developing a NE gazetteer (Attia et al., 2010) or classifying Wikipedia articles into NE semantic classes (Saleh et al., 2010). Tkatchenko et al. (2011) expanded the classification into an 18 fine-grain taxonomy extracted from (BNN). ...
... Several similar features have been selected (e.g. (Saleh et al., 2010;Dakka and Cucerzan, 2008)). ...
Conference Paper
Full-text available
This paper presents a methodology to ex-ploit the potential of Arabic Wikipedia to as-sist in the automatic development of a large Fine-grained Named Entity (NE) corpus and gazetteer. The corner stone of this approach is efficient classification of Wikipedia articles to target NE classes. The resources developed were thoroughly evaluated to ensure reliability and a high quality. Results show the developed gazetteer boosts the performance of the NE classifier on a news-wire domain by at least 2 points F-measure. Moreover, by combining a learning NE classifier with the developed cor-pus the score achieved is a high F-measure of 85.18%. The developed resources overcome the limitations of traditional Arabic NE tasks by more fine-grained analysis and providing a beneficial route for further studies.
... Naïve Bayes and the Support Vector Machine (SVM) were chosen as the statistical interface exploiting a specific set of features; such as bag-of-words, structured data, unigram and bigram context. Recently, Saleh et al. (2010) proposed a similar approach to classifying multilingual Wikipedia articles into traditional NE classes. The assumption in that case was that most Wikipedia articles relate to a named entity. ...
Conference Paper
Full-text available
This paper describes a comprehensive set of experiments conducted in order to classify Arabic Wikipedia articles into predefined sets of Named Entity classes. We tackle using four different classifiers, namely: Naïve Bayes, Multinomial Naïve Bayes, Support Vector Machines, and Stochastic Gradient Descent. We report on several aspects related to classification models in the sense of feature representation, feature set and statistical modelling. The results reported show that, we are able to correctly classify the articles with scores of 90% on Precision, Recall and balanced F-measure.
... Prior works on classifying Wikipedia articles [10,11,2,6,7,4,9,8] are for named entity recognition (NER) [5] instead of suggesting infobox template types. The consequence is that they only deal with very small number of classes (between 3 and 18) such as PER-SON, ORGANIZATION, and LOCATION, which is also the classic setup in NER-related studies. ...
Conference Paper
Full-text available
Given the sheer amount of work and expertise required in authoring Wikipedia articles, automatic tools that help Wikipedia contributors in generating and improving content are valuable. This paper presents our initial step towards building a full-fledged author assistant, particularly for suggesting infobox templates for articles. We build SVM classifiers to suggest infobox template types, among a large number of possible types, to Wikipedia articles without infoboxes. Different from prior works on Wikipedia article classification which deal with only a few label classes for named entity recognition, the much larger 337-class setup in our study is geared towards realistic deployment of infobox suggestion tool. We also emphasize testing on articles without infoboxes, due to that labeled and unlabeled data exhibit different distributions of features, which departs from the typical assumption that they are drawn from the same underlying population.
Conference Paper
Recognition of named entities (people, companies, locations, etc) is an essential task of text analytics. We address the subproblem of this task, namely, named entity classification. We propose a novel approach that constructs an effective fine-grained named entity classifier. Its key highlights are semi-automatic training set construction from Wikipedia articles and additional feature selection. We justify our solution by creating 18-class classifier and demonstrating its effectiveness and efficiency.