Figure 5 - uploaded by Carmen Galvez
Content may be subject to copyright.
Master graph for Personal Names.

Master graph for Personal Names.

Source publication
Article
Full-text available
This article shows how finite-state methods can be employed in a new and different task: the conflation of personal name variants in standard forms. In bibliographic databases and citation index systems, variant forms create problems of inaccuracy that affect information retrieval, the quality of information from databases, and the citation statist...

Context in source publication

Context 1
... this procedure, not only do we achieve the semiautomatic construction of the grammar that will generate combinations of the thousands of possible variant instances stored in the matrices but we also automate the construction of the FST in charge of recognizing them and conflating them into canonical forms. The extensive finite-state graph elaborated for the automatic construction of this LG is presented in Figure 5. ...

Similar publications

Article
Full-text available
Collaboration among scholars and institutions is progressively becoming essential to the success of research grant procurement and to allow the emergence and evolution of scientific disciplines. Our work focuses on analysing if the volume of collaborations of one author together with the relevance of his collaborators is somewhat related to his res...
Article
Full-text available
Online bibliographic databases are providing significant resources to conduct analysis of academic social networks. We believe that work of an author is always influenced by work of his or her co-authors. In this study, we investigate the impact of productivity and quality of work of an author's co-authors on his or her ranking along with his own c...
Article
Full-text available
Analyzes the research activities of the Government Medical College & Hospital (GMCH), Chandigarh, as reflected in its 16 years (1992-2007) of 754 publications output covered in Scopus international multidisciplinary bibliographical database. Focuses on publication growth characteristics, format and media of communication, research impact and qualit...
Article
Full-text available
This paper analysis the pollution control research output in BRIC countries from 2006 to 2015. A total number of 8395 data’s are extracted form SCOPUS international multidisciplinary bibliographic database. The data are analyzed the year wise growth of publications, document type, Countries collaboration, language wise publications, citations and t...
Article
Full-text available
Background Identification of children with specific language impairment (SLI) has been viewed as both necessity and challenge. Investigators and clinicians use different tests and measures for this purpose. Some of these tests/measures have good psychometric properties, but it is not sufficient for diagnostic purposes. A diagnostic procedure can be...

Citations

... In contrast, learnable similarity metrics can be trained using labeled samples of co-referent and non-co-referent names. Similarity metrics can be acquired through machine learning classifiers trained on features derived from pairwise string comparisons, encompassing static similarity scores [3,16], and probabilistic finite-state transducers, which estimate the probability of specific variations within the context of adjacent characters [17,18,19]. Training classifiers with multiple static similarity metrics can unveil nonlinear relationships between similarity scores and matching probabilities, offering a data-driven approach to weighing various definitions of string similarity. ...
Article
Full-text available
Name matching plays a crucial role in big data and various integration applications, being indispensable when consolidating information from diverse sources. This encompasses tasks such as deduplication, data linkage systems, search engines, text and web mining, information extraction, and more. Discrepancies and anomalies in names, including syntax variations like abbreviations, typographical errors, occasional whitespace omissions, word insertions, deletions, and even multiple spellings for the same name, can lead to missed matches. In previous methodologies, a predefined penalty scheme was often employed for each differing character or multi-character token between two strings. This research introduces Name2Vec, an algorithm that addresses name matching using a neural network model to capture name semantics. This approach advances by suggesting a suitable feature set through the fusion of Name2Vec and character-based name representations. The empirical findings of this research confirm that this performance enhancement improves matching efficiency while simultaneously reducing misclassifications compared to state-of-the-art methods.
... A widely used system targeted for only harvesting metadata is the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) 43 .The OAI-PMH protocol is based on HTTP where request arguments are issued as GET or POST parameters of a URL. There are six data request types available called "verbs," such as ListRecords for creating and fetching a list of records. ...
... It can also be used to disambiguate authors with similar or identical names. Automatic tools for creating authority records include clustering [41] and other name matching algorithms such as [19,43], but even with these methods, human interaction is often required. ...
Article
Epistolary data about historical letters is typically distributed in different archives depending on where the letters were sent to and received, and the data are represented using local heterogeneous data models and different natural languages. To study such letter data on a global level, the heterogeneous, distributed data in local siloes need to be aggregated and harmonized into larger services where local metadata can enrich each other to complement missing information. This paper presents a new framework, LetterSampo, for representing, publishing, and using epistolary data as Linked Open Data (LOD) on the Web for Digital Humanities (DH) research. The framework is used for creating LOD services and for building individual LetterSampo portals on top of them. To test and demonstrate the framework, it has been applied to the epistolary CKCC dataset of ca. 20000 letters of the Huygens Institute, the Netherlands, to the correspSearch dataset of ca. 151000 letters aggregated by the Berlin-Brandenburg Academy of Sciences and Humanities, and to the Early Modern Letters Online (EMLO) data of ca. 170000 letters published by the University of Oxford. The CKCC and correspSearch datasets were published as LOD services, SPARQL endpoints, and as data dumps at Zenodo.org for re-use, and a demonstrational portal LetterSampo: Historical Letters on the Semantic Web was created based on this data. A novelty of the LetterSampo portals is to use faceted semantic search for filtering data of interest in flexible ways from multiple perspectives on two conceptual levels, and then visualize and analyze the results and data by seamlessly integrated data analytic tools—programming skills are not needed for using the portals. In addition to using the tools of the portal, the SPARQL endpoints can be used with modest knowledge about programming for DH research.
... Similarly, the papers written just before the change of research area will be self-cited hardly ever. Galvez and Aneg´on [41] addressed the conflation of personal name variants problem in a standard or canonical form exploiting finite-state transducers and binary matrices. They divide the variants into valid (the variation among legitimate variants and canonical forms, e.g., such as the lack of some components of a full name, the absence or use of punctuation marks, and the use of initials) and non-valid (the variation among non-legitimate variants and correct forms, e.g., miss-spellings, involving deletions or insertions of characters in the strings, nicknames, abbreviations, and errors of accentuation in the names from certain languages) categories. ...
Article
Full-text available
Entity resolution is a challenging and hot research area in the field of Information Systems for the last decade. Author name disambiguation in bibliographic databases like DBLP 1, Citeseer 2 , and Scopus 3 is a specialized field of entity resolution. Given many citations of underlying authors, the author name disambiguation task is to find which citations belong to the same author. In this survey, we start with three basic author name disambiguation problems, followed by a need for solutions and challenges. A generic, five-step framework is provided for handling author name disambiguation issues. These steps are preparation of dataset, selection of publication attributes, selection of similarity metrics, selection of models, and performance evaluation of clustering. Categorization and elaboration of similarity metrics and methods are also provided. Finally, future directions and recommendations are given for this dynamic area of research.
... Similarly, the papers written just before the change of research area will be self-cited hardly ever. Galvez and Aneg´on [38] addressed the conflation of personal name variants problem in a standard or canonical form exploiting finite-state transducers and binary matrices. They divide the variants into valid (the variation among legitimate variants and canonical forms, e.g., such as the lack of some components of a full name, the absence or use of punctuation marks, the use of initials, etc) and non-valid (the variation among non-legitimate variants and correct forms, e.g., miss-spellings, involving deletions or insertions of characters in the strings, nicknames, abbreviations, errors of accentuation in the names from certain languages, etc) categories. ...
Preprint
Full-text available
Entity resolution is a challenging and hot research area in the field of Information Systems since last decade. Author Name Disambiguation (AND) in Bibliographic Databases (BD) like DBLP , Citeseer , and Scopus is a specialized field of entity resolution. Given many citations of underlying authors, the AND task is to find which citations belong to the same author. In this survey, we start with three basic AND problems, followed by need for solution and challenges. A generic, five-step framework is provided for handling AND issues. These steps are; (1) Preparation of dataset (2) Selection of publication attributes (3) Selection of similarity metrics (4) Selection of models and (5) Clustering Performance evaluation. Categorization and elaboration of similarity metrics and methods are also provided. Finally, future directions and recommendations are given for this dynamic area of research.
... In contrast, learnable similarity metrics can be trained using labeled examples of coreferent and non-co-referent names. Learnable similarity metrics encompass the use of machine learning classifiers trained on features extracted from pairwise string comparisons (such as static similarity scores) (12,13), as well as probabilistic finite-state transducers, which estimate the probability of specific variations in the context of surrounding characters (14)(15)(16). Training classifiers on multiple static similarity metrics can reveal non-linear relationships between similarity scores and matching probabilities and provide a means of empirically weighting various definitions of string similarity. Finite-state transducers, meanwhile, can learn penalties associated with specific character deletions, insertions, and substitutions in context, but are consequently much more complex, and require extensive training and validation data to achieve good performance. ...
Preprint
Full-text available
Approximate string-matching methods to account for complex variation in highly discriminatory text fields, such as personal names, can enhance probabilistic record linkage. However, discriminating between matching and non-matching strings is challenging for logographic scripts, where similarities in pronunciation, appearance, or keystroke sequence are not directly encoded in the string data. We leverage a large Chinese administrative dataset with known match status to develop logistic regression and Xgboost classifiers integrating measures of visual, phonetic, and keystroke similarity to enhance identification of potentially-matching name pairs. We evaluate three methods of leveraging name similarity scores in large-scale probabilistic record linkage, which can adapt to varying match prevalence and information in supporting fields: (1) setting a threshold score based on predicted quality of name-matching across all record pairs; (2) setting a threshold score based on predicted discriminatory power of the linkage model; and (3) using empirical score distributions among matches and nonmatches to perform Bayesian adjustment of matching probabilities estimated from exact-agreement linkage. In experiments on holdout data, as well as data simulated with varying name error rates and supporting fields, a logistic regression classifier incorporated via the Bayesian method demonstrated marked improvements over exact-agreement linkage with respect to discriminatory power, match probability estimation, and accuracy, reducing the total number of misclassified record pairs by 21% in test data and up to an average of 93% in simulated datasets. Our results demonstrate the value of incorporating visual, phonetic, and keystroke similarity for logographic name matching, as well as the promise of our Bayesian approach to leverage name-matching within large-scale record linkage.
... Like Gurney et al. [6], in our work, we consider this problem already solved. Although they too focus on these normalization problems, our work has many accordances with work by Galvez and Moya-Anegón [5], who also use a graph structure and rely on the notion of blocks being equivalence classes over some relation. Instead of using a hierarchical structure as we do, they define a finite-state machine that parses a name from left to right and conflates different standards. ...
Conference Paper
In this work, we address the problem of blocking in the context of author name disambiguation. We describe a framework that formalizes different ways of name-matching to determine which names could potentially refer to the same author. We focus on name variations that follow from specifying a name with different completeness (i.e. full first name or only initial). We extend this framework by a simple way to define traditional, new and custom blocking schemes. Then, we evaluate different old and new schemes in the Web of Science. In this context we define and compare a new type of blocking schemes. Based on these results, we discuss the question whether name-matching can be used in blocking evaluation as a replacement of annotated author identifiers. Finally, we argue that blocking can have a strong impact on the application and evaluation of author disambiguation.
... Contrasted with successive ones, finding them is particularly intriguing and huge. Hypothetically, it characterizes another sort of examples for uncommon occasion mining, which can portray customized and unusual practices for extraordinary clients [7]. ...
... Topic Detection and Tracking (TDT) task [3], [9] aimed to detect and track topics (events) in news streams with clustering-based techniques on keywords. Considering the cooccurrence of words and their semantic associations, a lot of probabilistic generative models for extracting topics from documents were also proposed, such as PLSI , LDA [7] and their extensions integrating different features of documents [5] as well as models for short texts like Twitter-LDA . In many real applications, document collections generally carry temporal information and can thus be considered as document streams. ...
... Our null hypothesis is straightforward: The alternative hypothesis is: The regression model is as follows: Y X where Y represents a clubbing effect These three outcome variables are proportions bounded by 0 and 1 inclusive. For individual self-citations, we confine our analysis to primary authors due to well-known author ambiguity issues (Galvez & Moya-Anegon, 2007; Onodera et al., 2011) Independent variable. Our explanatory variable is China's heavy hitters (i.e., a highly cited Chinese nanotechnology articles). ...
... Some of the string matching algorithms [9] used for extracting variants or abbreviations of personal names. For instance, matching " Ram Kumar " with the first name initialized variant 'R. ...
Conference Paper
Full-text available
A person may have multiple personal name aliases on the web. Identifying aliases of a name is useful in information retrieval and knowledge management, sentiment analysis, relation extraction and name disambiguation. The objective of detecting aliases from the web is to retrieve all the information pertaining to a personal name whose content is described with different nick names in different documents of web. As of now, web contains aliases of popular personalities in various domains like sports, politics, medicine, music, cinema etc., and does not contain alias information about common man. Recently, there are proven methods of extracting aliases through lexical pattern based retrieval tested using real-world name-alias pairs in Japanese and English as training data related to limited domains. In this paper, we discuss about various personal name disambiguation methods used in web related tasks.