Master graph for Personal Names.

Source publication

Approximate Personal Name-Matching Through Finite-State Graphs

Article

Full-text available

Nov 2007

This article shows how finite-state methods can be employed in a new and different task: the conflation of personal name variants in standard forms. In bibliographic databases and citation index systems, variant forms create problems of inaccuracy that affect information retrieval, the quality of information from databases, and the citation statist...

Context 1

... this procedure, not only do we achieve the semiautomatic construction of the grammar that will generate combinations of the thousands of possible variant instances stored in the matrices but we also automate the construction of the FST in charge of recognizing them and conflating them into canonical forms. The extensive finite-state graph elaborated for the automatic construction of this LG is presented in Figure 5. ...

View in full-text

The Evolution of Your Success Lies at the Centre of Your Co-Authorship Network

Article

Full-text available

Mar 2015

Collaboration among scholars and institutions is progressively becoming essential to the success of research grant procurement and to allow the emergence and evolution of scientific disciplines. Our work focuses on analysing if the volume of collaborations of one author together with the relevance of his collaborators is somewhat related to his res...

Impact of mutual influence while ranking authors in a co-authorship network

Article

Full-text available

Jan 2016

Online bibliographic databases are providing significant resources to conduct analysis of academic social networks. We believe that work of an author is always influenced by work of his or her co-authors. In this study, we investigate the impact of productivity and quality of work of an author's co-authors on his or her ranking along with his own c...

Growth and impact of research output of Government Medical College & Hospital, Chandigarh: A case study

Article

Full-text available

Jul 2009

Analyzes the research activities of the Government Medical College & Hospital (GMCH), Chandigarh, as reflected in its 16 years (1992-2007) of 754 publications output covered in Scopus international multidisciplinary bibliographical database. Focuses on publication growth characteristics, format and media of communication, research impact and qualit...

Pollution Control Research Output in BRIC Countries during 2006-2015 from SCOPUS Database: A Scientometric Analysis

Article

Full-text available

Oct 2018

This paper analysis the pollution control research output in BRIC countries from 2006 to 2015. A total number of 8395 data’s are extracted form SCOPUS international multidisciplinary bibliographic database. The data are analyzed the year wise growth of publications, document type, Countries collaboration, language wise publications, citations and t...

A systematic review on diagnostic procedures for specific language impairment: The sensitivity and specificity issues

Article

Full-text available

Sep 2016

Background Identification of children with specific language impairment (SLI) has been viewed as both necessity and challenge. Investigators and clinicians use different tests and measures for this purpose. Some of these tests/measures have good psychometric properties, but it is not sufficient for diagnostic purposes. A diagnostic procedure can be...

Name2Vec: Name Matching using Character-based with Deep Learning

Article

Full-text available

Jan 2023

Xuan Truong Dinh

Name matching plays a crucial role in big data and various integration applications, being indispensable when consolidating information from diverse sources. This encompasses tasks such as deduplication, data linkage systems, search engines, text and web mining, information extraction, and more. Discrepancies and anomalies in names, including syntax variations like abbreviations, typographical errors, occasional whitespace omissions, word insertions, deletions, and even multiple spellings for the same name, can lead to missed matches. In previous methodologies, a predefined penalty scheme was often employed for each differing character or multi-character token between two strings. This research introduces Name2Vec, an algorithm that addresses name matching using a neural network model to capture name semantics. This approach advances by suggesting a suitable feature set through the fusion of Name2Vec and character-based name representations. The empirical findings of this research confirm that this performance enhancement improves matching efficiency while simultaneously reducing misclassifications compared to state-of-the-art methods.

LetterSampo – Historical Letters on the Semantic Web: A Framework and Its Application to Publishing and Using Epistolary Data

Article

Nov 2022

Epistolary data about historical letters is typically distributed in different archives depending on where the letters were sent to and received, and the data are represented using local heterogeneous data models and different natural languages. To study such letter data on a global level, the heterogeneous, distributed data in local siloes need to be aggregated and harmonized into larger services where local metadata can enrich each other to complement missing information. This paper presents a new framework, LetterSampo, for representing, publishing, and using epistolary data as Linked Open Data (LOD) on the Web for Digital Humanities (DH) research. The framework is used for creating LOD services and for building individual LetterSampo portals on top of them. To test and demonstrate the framework, it has been applied to the epistolary CKCC dataset of ca. 20000 letters of the Huygens Institute, the Netherlands, to the correspSearch dataset of ca. 151000 letters aggregated by the Berlin-Brandenburg Academy of Sciences and Humanities, and to the Early Modern Letters Online (EMLO) data of ca. 170000 letters published by the University of Oxford. The CKCC and correspSearch datasets were published as LOD services, SPARQL endpoints, and as data dumps at Zenodo.org for re-use, and a demonstrational portal LetterSampo: Historical Letters on the Semantic Web was created based on this data. A novelty of the LetterSampo portals is to use faceted semantic search for filtering data of interest in flexible ways from multiple perspectives on two conceptual levels, and then visualize and analyze the results and data by seamlessly integrated data analytic tools—programming skills are not needed for using the portals. In addition to using the tools of the portal, the SPARQL endpoints can be used with modest knowledge about programming for DH research.

Author Name Disambiguation in Bibliographic Databases: A Survey

Article

Full-text available

Dec 2021

Entity resolution is a challenging and hot research area in the field of Information Systems for the last decade. Author name disambiguation in bibliographic databases like DBLP 1, Citeseer 2 , and Scopus 3 is a specialized field of entity resolution. Given many citations of underlying authors, the author name disambiguation task is to find which citations belong to the same author. In this survey, we start with three basic author name disambiguation problems, followed by a need for solutions and challenges. A generic, five-step framework is provided for handling author name disambiguation issues. These steps are preparation of dataset, selection of publication attributes, selection of similarity metrics, selection of models, and performance evaluation of clustering. Categorization and elaboration of similarity metrics and methods are also provided. Finally, future directions and recommendations are given for this dynamic area of research.

Author Name Disambiguation in Bibliographic Databases: a Survey

Preprint

Full-text available

Apr 2020

Entity resolution is a challenging and hot research area in the field of Information Systems since last decade. Author Name Disambiguation (AND) in Bibliographic Databases (BD) like DBLP , Citeseer , and Scopus is a specialized field of entity resolution. Given many citations of underlying authors, the AND task is to find which citations belong to the same author. In this survey, we start with three basic AND problems, followed by need for solution and challenges. A generic, five-step framework is provided for handling AND issues. These steps are; (1) Preparation of dataset (2) Selection of publication attributes (3) Selection of similarity metrics (4) Selection of models and (5) Clustering Performance evaluation. Categorization and elaboration of similarity metrics and methods are also provided. Finally, future directions and recommendations are given for this dynamic area of research.

Machine-learning classifiers for logographic name matching in public health applications: approaches for incorporating phonetic, visual, and keystroke similarity in large-scale probabilistic record linkage

Preprint

Full-text available

Jan 2020

Approximate string-matching methods to account for complex variation in highly discriminatory text fields, such as personal names, can enhance probabilistic record linkage. However, discriminating between matching and non-matching strings is challenging for logographic scripts, where similarities in pronunciation, appearance, or keystroke sequence are not directly encoded in the string data. We leverage a large Chinese administrative dataset with known match status to develop logistic regression and Xgboost classifiers integrating measures of visual, phonetic, and keystroke similarity to enhance identification of potentially-matching name pairs. We evaluate three methods of leveraging name similarity scores in large-scale probabilistic record linkage, which can adapt to varying match prevalence and information in supporting fields: (1) setting a threshold score based on predicted quality of name-matching across all record pairs; (2) setting a threshold score based on predicted discriminatory power of the linkage model; and (3) using empirical score distributions among matches and nonmatches to perform Bayesian adjustment of matching probabilities estimated from exact-agreement linkage. In experiments on holdout data, as well as data simulated with varying name error rates and supporting fields, a logistic regression classifier incorporated via the Bayesian method demonstrated marked improvements over exact-agreement linkage with respect to discriminatory power, match probability estimation, and accuracy, reducing the total number of misclassified record pairs by 21% in test data and up to an average of 93% in simulated datasets. Our results demonstrate the value of incorporating visual, phonetic, and keystroke similarity for logographic name matching, as well as the promise of our Bayesian approach to leverage name-matching within large-scale record linkage.

The Impact of Name-Matching and Blocking on Author Disambiguation

Conference Paper

Oct 2018

Tobias Backes

In this work, we address the problem of blocking in the context of author name disambiguation. We describe a framework that formalizes different ways of name-matching to determine which names could potentially refer to the same author. We focus on name variations that follow from specifying a name with different completeness (i.e. full first name or only initial). We extend this framework by a simple way to define traditional, new and custom blocking schemes. Then, we evaluate different old and new schemes in the Web of Science. In this context we define and compare a new type of blocking schemes. Based on these results, we discuss the question whether name-matching can be used in blocking evaluation as a replacement of annotated author identifiers. Finally, we argue that blocking can have a strong impact on the application and evaluation of author disambiguation.

Efficient Way to Identify User Aware Rare Sequential Patterns in Document Streams

Article

Jun 2017

Study on Efficient Way to Identify User Aware Rare Sequential Pattern Matching in Document Stream

Article

Feb 2017

Swati V. Mengje

clubbing- jasist

Data

Dec 2015

Framework on Extracting Personal Name Pseudonyms from the Web

Conference Paper

Full-text available

Nov 2015

A person may have multiple personal name aliases on the web. Identifying aliases of a name is useful in information retrieval and knowledge management, sentiment analysis, relation extraction and name disambiguation. The objective of detecting aliases from the web is to retrieve all the information pertaining to a personal name whose content is described with different nick names in different documents of web. As of now, web contains aliases of popular personalities in various domains like sports, politics, medicine, music, cinema etc., and does not contain alias information about common man. Recently, there are proven methods of extracting aliases through lexical pattern based retrieval tested using real-world name-alias pairs in Japanese and English as training data related to limited domains. In this paper, we discuss about various personal name disambiguation methods used in web related tasks.

Master graph for Personal Names.

Context in source publication

Similar publications

Citations