PreprintPDF Available

Author Name Disambiguation in Bibliographic Databases: a Survey

April 2020

April 2020

Authors:

Ali Daud

Rabdan Academy

Tehmina Amjad

Northeastern Univeristy

Preprints and early-stage research may not have been peer reviewed yet.

Entity resolution is a challenging and hot research area in the field of Information Systems since last decade. Author Name Disambiguation (AND) in Bibliographic Databases (BD) like DBLP , Citeseer , and Scopus is a specialized field of entity resolution. Given many citations of underlying authors, the AND task is to find which citations belong to the same author. In this survey, we start with three basic AND problems, followed by need for solution and challenges. A generic, five-step framework is provided for handling AND issues. These steps are; (1) Preparation of dataset (2) Selection of publication attributes (3) Selection of similarity metrics (4) Selection of models and (5) Clustering Performance evaluation. Categorization and elaboration of similarity metrics and methods are also provided. Finally, future directions and recommendations are given for this dynamic area of research.

NOTATIONS

…

J. ANDERSON" PART OF FIRST DATASET USED BY HAN ET AL. [48]

…

USED BY FERREIRA ET AL. [55]

…

OF SUPERVISED LEARNING METHODS

…

OF SEMI-SUPERVISED LEARNING METHODS

…

Figures - uploaded by Ali Daud

Content may be subject to copyright.

Content uploaded by Ali Daud

Content may be subject to copyright.

Content uploaded by Ali Daud

Content may be subject to copyright.

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 87

Received: 1 May 2020; Revised: 8 Dec 2021; Accepted: 10 Dec 2021; Published: 26 Dec 2021

Journal Name [Online ISSN Coming Soon], Volume 2, Issue 1, Article 9, Pages 87-110, December 2021

Digital Object Identifier 10.1111/RpJC.2020.DOINumber

Author Name Disambiguation in Bibliographic

Databases: A Survey

Muhammad Shoaib

, Ali Daud

, Tehmina Amjad

Department of Computer Science, Comsats University, Sahiwal Campus, Pakistan

Department of Information Systems and Technology, College of Computer Science and Engineering, University of

Jeddah, Saudi Arabia

Department of Computer Science and Software Engineering, International Islamic University, Islamabad, Pakistan

Corresponding author: Tehmina Amjad (tehminaamjad@iiu.edu.pk)

ABSTRACT Entity resolution is a challenging and hot research area in the field of Information Systems for the last

decade. Author name disambiguation in bibliographic databases like DBLP

Citeseer

, and Scopus

is a specialized

field of entity resolution. Given many citations of underlying authors, the author name disambiguation task is to find

which citations belong to the same author. In this survey, we start with three basic author name disambiguation

problems, followed by a need for solutions and challenges. A generic, five-step framework is provided for handling

author name disambiguation issues. These steps are preparation of dataset, selection of publication attributes, selection

of similarity metrics, selection of models, and performance evaluation of clustering. Categorization and elaboration

of similarity metrics and methods are also provided. Finally, future directions and recommendations are given for this

dynamic area of research.

Keywords Author Name Disambiguation, Bibliographic Databases, Entity Resolution, Metrics, Similarity.

I. INTRODUCTION

The scholarly societies that are constituted via bibliometric networks are growing with progress in scientific

research[1–4]. The network science methods cover several aspects of the study of evolving sciences like the

relationship between professions and their careers [5], finding the emerging stars from scholarly networks [6–8], the

study of citation networks [9–12], social media analytics [13–16], expert ranking methods [17]–[19]. The problem of

entity resolution has attracted the attention of information system researchers for a long time now. Author Name

Disambiguation (AND) in Bibliographic Databases (BD) is a hot issue and is a specialized field of entity resolution.

Author name disambiguation is the process of distinguishing authors with similar names from each other. The

bibliographic databases include a large amount of data from co-author networks and digital libraries. Authors or

researchers can have similar names, can have multiple ways of writing their full names, or different authors can share

multiple names. These situations arise the ambiguity for the methods that need the publication metadata for ranking

or evaluating the authors [20–24]. The disambiguation methods are not only required in co-author networks but are

also significant in fields like spam filtering [25–27]. Search engines like Google

facilitate the users in searching web

pages automatically. The name queries are approximately 5-10% of all queries [28]. Further, it is estimated that the

300 most common male names are used by more than 114 million people in the United States [29]. Search engines

usually treat the name queries as normal keyword searches and do not pay any special attention towards their possible

ambiguity. For example, when searching for Tehmina Amjad on Google, it shows 228,000 web pages containing

similar names. Out of these pages, only a small portion is relevant to the intended Tehmina Amjad. This is because

the data on the internet is heterogeneous.

http://www.informatik.uni-trier.de/~ley/db/

http://citeseer.ist.psu.edu/

http://www.scopus.com/home.url

http:// www.google.com

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 88

In BD, it is necessary to uniquely identify the work of one researcher from another, and this process is known as AND.

Formally, a bibliographic database is an organized digital store of citations to research publications, patents, books,

and news articles. It stores the metadata of the publications. Examples of commonly used BD are DBLP [30], CiteSeer

[31], MEDLINE1, and Google Scholar2. An AND method that best fits a bibliographic dataset may not be suitable for

other datasets. The reason behind this is that they differ in their metadata schema. Most of the methods fall in either

supervised learning or unsupervised learning or a combination of the two.

Smalheiser and Torvik [32] have provided a detailed literature survey of methods for AND but their work has many

shortcomings, such as a general framework is not provided, similarity metrics and methods are not explained category-

wise in detail. a comprehensive survey of the existing author name disambiguation (AND) approaches that have been

applied to the PubMed database by Sanyal et al. [33]. The authors classify the approaches into a taxonomy and describe

the key characteristics of each approach, such as its performance, strengths, and weaknesses. They have also identified

the PubMed datasets that are publicly available for researchers to evaluate AND algorithms.

Our contributions in this work are as follows

(1) Proposal of a general framework for AND

(2) Categorization and elaboration of similarity metrics which are the main focus of researchers in AND to

find the resemblance among citations and

(3) Categorization of methods used to handle AND task into five types with the elaboration of works falling

under each category in chronological order.

The rest of the paper is organized as follows. Section 2 describes AND tasks and related concepts. Section 3 provides

a general framework based on most of the methods used in the past. Section 4 is about the commonly used datasets to

perform AND. Section 5 is about the similarity estimation metrics. Section 6 categorizes the methods employed for

AND and explains categories in chronological order. Section 7 explains how to compare different methods and some

future directions and recommendations are suggested in section 8. Finally, section 9 concludes this paper.

II. AUTHOR NAME DISAMBIGUATION IN BIBLIOGRAPHIC DATABASES (ANDBD)

Resolving the name ambiguity in Bibliographic Databases is called ANDBD. In literature many terms are used for this

problem like name disambiguation [34], [35], object distinction [36], mixed and split citation [37], author

disambiguation [38] and entity resolution [39], [40]. ANDBD problems can be divided into three categories. Before

discussing ANDBD problem categories through intuitive examples, some related basic concepts are provided.

Publication: A publication means the research work/article/paper of an author or group of authors working together

published at any venue (journal, conference, or workshop).

Citations: The number of times a publication is cited/referenced by other publications.

References: It is the list of references given at the end of a publication.

Ambiguous Author name(s): A name that is either shared by multiple authors or multiple variant names of a single

author. Let A be the ambiguous author name shared by k number of unique authors, say, a1, a2,… , ak. Further let ai is

an author represented by m number of various names, say, n1, n2,…, nm. In this article, we use “ambiguous author

name”, “ambiguous author” and “ambiguous name”, interchangeably.

A. Problem Categories

1) SYNONYMY/NAME VARIANT PROBLEM

The problem of Synonymy arises when an author has variations or abbreviations in his/her name in the citations. For

example, the author name “Malik Sikandar Hayat Khiyal” is also written as “Sikandar Hayat” in citations of the

publications. The DBLP treats them as two different authors and divides his publications between two names. In

literature, this problem is also referred to as name variant problem [40], [41], entity resolution problem [39], split

citation problem [37] and aliasing problem [42].

2) POLYSEMY/NAME SHARING PROBLEM

The problem of Polysemy arises when multiple authors share the same name label in multiple citations. For example,

“Guilin Chen” and “Guangyu Chen” write their names as “G. Chen” in their publications. A full name of an author

may be shared by multiple authors. Bibliographic databases may treat these different authors as a single author.

Resultantly, on querying the database for such ambiguous names, it may list all publications under a single person’s

name. On querying DBLP against the author name “Michael Johnson” it lists 32 publications that are actually from

five different people [40]. In literature there are various names of this problem such as name disambiguation [34],

1 www.ncbi.nlm.nih.gov

2 scholar.google.com

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 89

[35], [43], object distinction [36], mixed citation [37], author disambiguation [38] and the common name problem

[40].

3) NAME MIXING PROBLEM

Shu et al. [40] introduced another type of name disambiguation problem and referred to it as a name mixing problem.

If multiple persons share multiple names, it is called the name mixing problem. The two problems discussed above

may occur simultaneously and cause the name mixing problem.

Typographical mistakes also cause name ambiguity. Treeratpituk and Giles [42] consider the typographical mistakes

in names as a separate name disambiguation problem. These problems may arise due to the use of abbreviations,

spelling mistakes; and occasionally using caste or family name at the end or at the beginning of names. L. Branting

[44] has discussed nine different types of name variations.

B. NEED FOR THE SOLUTION

Name ambiguity may cause incorrect authorship identification in literary works resulting in improper credit attribution

to the authors. AND is a basic and compulsory step for performing bibliometric and scientometric analyses.

Disambiguating authors may help establish precisely, author profiles, co-author networks, and citation networks. In

academic digital libraries, disambiguating author names is necessary for the following reasons.

 Users are interested in finding papers written by a particular researcher [45]

 Research communities and institutions can track the achievements of their researchers [46]

 It also helps in expert finding from which publishers can easily find paper reviewers [47]

C. CHALLENGES INVOLVED IN AND

Certain challenges are involved in AND, some of which are highlighted in the following.

 Lack of identifying information: The identifier metadata are either incomplete or not available at all.

 Multi-directional problem: multi-disciplinary papers authored by multiple researchers from multiple institutions

(nationwide or worldwide) may cause ‘multiple entities disambiguation’ problem.

 Less number of papers by most of the authors: The machine learning techniques used for AND give better results

when a reasonable number of examples are available. This is only possible when the individual authors have

produced many papers. In MEDLINE almost 46% of the authors have written only one paper [48]. The authors

having one or a few papers are a big hindrance for proposing precise machine learning techniques.

 Heterogeneous nature of BD: The BD are heterogeneous in many ways, like schema heterogeneity, discipline

heterogeneity, language heterogeneity and attributes heterogeneity.

 The non-serious attitude of the authors: Sometimes the authors are reluctant in registering a universal

identification system like UAI_Sys [49] or [50] or making consolidated profiles.

 Economic issue: The construction of such a database that can accommodate and manage the worldwide

researcher’s community including all the disciplines, nations, and languages is not only economically unfeasible

but also probably impossible.

 Ownership issue: While testing the algorithm for AND sometimes confirmation of the original author becomes

doubtful.

D. IS A UNIQUE IDENTIFIER FOR AN AUTHOR A VIABLE SOLUTION?

One may think that unique identifiers, say, Author Identification Number (AID), can be a simple and reliable solution

for this problem. Dervos et al. [49] proposed UAI_Sys in which an author can register himself/herself by entering

his/her metadata information. The UAI_Sys in return assigns a 16-digit unique code to the author. ORCID [50] is also

a similar attempt for the same purpose, it issues 16 characters alphanumeric code to the researcher to uniquely identify

them. It offers a permanent identity for people, just like the ones issued for content-related entities on digital networks

by digital object identifiers Although it seems possible apparently, however, there are so many issues discussed in this

section that are very difficult to address and implement.

In Dervos et al. [49] project it is expected that authors would remember their passwords and UAIs. Researchers do

not pay attention to remember such lengthy codes. Further, all the co-authors are also bound to be registered with the

universal bibliographic database. A large number of authors may produce 2 or 3 papers in their whole life. Such casual

researchers take the least interest to be registered in the database. It is not only the casual researchers but regular

researchers (who produce a reasonable number of research papers) may also provide wrong metadata information to

the system. Sometimes it is too difficult to convince a researcher to be habitual to welcome new technologies. They

may resist giving up previous practices and adopting new ones.

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 90

If such a database is developed, ideally it should accommodate all the research areas, languages, states, and all types

of publications. Such a database seems not to be economical as it demands not only one-time expenses (developing

cost) but also huge running expenses including staff salaries, maintenance, and security of the database, and handling

the user queries.

E. MATHEMATICAL NOTATIONS

Table 1 provides the mathematical notations used in this paper.

ABLE

ATHEMATICAL NOTATIONS

Symbols Sets Description

A= {a

, a

, …, a

}, where a

is the ith author.

k is no. of unique authors sharing an ambiguous

name

Set of authors/persons sharing an ambiguous name

D= {d

, d

, …, d

}

Set of documents in a dataset

P= {

, …,

}

Set of publications/documents associated

with

an ambiguous author/name

No. of clusters = No. of unique authors

associated

with

an ambiguous name

V V = {v

, v

, …, v

}, where v is the number of

vertices

Set of vertices in a graph

E E = {e

, e

, …, e

}, where e is the number of

vertices

Set of edges in a graph

Number of unique authors

Set of words

Term, can be a word or set of words

III. ANDBD Process

In this section, we describe the general process of AND

BD.

We do not follow the process exploited by any particular

research work. We provide the common steps involved in AND

process. The purpose of this section is to help

readers comprehend AND

task more easily and clearly. Figure 1 is the block diagram of the AND

process.

Figure 1. ANDBD process

A. PREPARING THE DATASET

For AND a BD is used. The whole database is normally too large to analyze, within a limited time. To avoid killing

time in query processing in real-life databases, a tiny dataset is either selected from a functional BD or prepared from

scratch normally by crawling the web pages of ambiguous authors. For example, Han et al. [51] exploit two datasets,

one for 15 different “J. Anderson”s, and the other for 11 unique “J. Smith”s; while Wang et al. [52] used a dataset

containing 16 ambiguous names comprising 241 unique authors. Preprocessing in name disambiguation usually

includes blocking, stop-word removal, and stemming [53]. Stop-word removal and stemming steps are required for

the title words of publications and venues. A blocking step is performed to group together the authors with ambiguous

names. Disambiguation operations are performed within each ambiguous group to avoid useless comparisons and

operations involving non-ambiguous authors.

B. SELECTING THE PUBLICATION ATTRIBUTES

It is always desirable to utilize as many attributes of the publications as available though only useful ones are

considered. All BD do not provide the same number and type of attributes. But three common attributes: co-authors,

publication title, and venue; are available in almost all of them. We name these three attributes as triplet attributes.

Most of the studies like [51] use only triplet attributes, [40] exploits triplet attributes plus topic similarity. Some

Preparing the

dataset: Papers

list of ambiguous

name

Clusters:

No. of clusters equal to

No. of unique authors

Selecting

paper

attributes

Similarity (b/w

papers) estimation

modules

Performance evaluation

Methods

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 91

methods like [52], [54] take advantage of indirect co-authors, feedback, co-web, and publication year along with triplet

attributes. Torvik et al. [55] propose eight different attributes: (1) middle initial, (2) suffix (e.g., Prof. or II), (3) full

name, (4) language, (5) number of common co-authors, (6) number of common title words, (7) number of common

affiliation words and (8) number of common Medical Subject Headings (MeSH) words. As we add more and more

attributes, usually the accuracy increases a bit at the cost of time complexity. In AND time complexity is not much

cumbersome, however, unavailability of reasonable number of distinguishing attributes is a bottleneck.

C. SELECTING THE SIMILARITY ESTIMATORS

After the selection of available attributes, the most technical task is to select a proper similarity estimator for the

attributes. Almost all the methods in AND, work on the notion that the more the similarity values among the attributes

of the two citations, the more it is plausible that they belong to the same author. The focus of the proposed similarity

estimators is always to estimate the optimum similarity value among the attributes of the two papers. Various similarity

estimators for each type of attribute are exploited by the researchers. For example, Shu et al. [40] used edit distance

of two strings for co-author attribute, cosine similarity measure for the title and venue attributes, and Latent Dirichlet

Allocation (LDA) [56] topic model for semantic topic similarity.

D. SELECTING THE MODELS

In this study, we categorized the AND methods into five types (1) supervised learning (2) unsupervised learning (3)

semi-supervised learning (4) graph-based, and (5) ontology-based. Supervised learning models perform classification,

unsupervised learning methods perform clustering and semi-supervised models are a combination of both supervised

and unsupervised methods. Graph-based methods exploit links and ontology-based methods exploit semantics-based

relationships between entities. The purpose of all methods is to separate the publications of a unique author into a

unique class/cluster. A large number of methods are available, so first of all one must decide which type of method

will be employed. The pros and cons of each alternative are kept in mind before applying the method. One can think

to devise his/her new method as well. SVM and decision tree algorithm C4.5 classifiers are widely used classification

models in AND. On the other hand, random forests, spectral clustering, and DBSCAN are popular clustering models.

E. MEASURING THE PERFORMANCE

The performance of the method used is measured using different performance metrics. Precision, recall, and F-measure

are very common performance metrics used for the evaluation of AND methods

IV. Datasets

The well-known BD like DBLP, MEDLINE, DBComp, Scopus, and CiteSeer have been widely utilized by the

researchers for AND. DBLP is the most widely used database for this purpose. Its basic reason, perhaps, is that the

publication records in DBLP are represented in a well-structured format, i.e., XML. The basic issue faced by the

researchers is how to measure the performance of the proposed method with standard/huge databases. For this purpose,

they pick a few ambiguous names from the database along with their publications and other discriminative attributes

and investigate the performance of their proposed method.

For example, Han et al. [51] exploited two types of datasets: (1) Collected manually from the web by querying Google,

and (2) selected ambiguous names from DBLP. The first dataset consists of two ambiguous names “J. Anderson” and

“J. Smith”. “J. Anderson”. Part of the dataset consists of 15 unique authors who share the same name, and 229

publications; “J. Smith” is shared by 11 different authors whose total publications are 338. “J. Anderson” part of the

first dataset is shown in Table 2. Tables 2, 3, and 4 show some examples of name ambiguity. We can see from Table

2 that there are 15 different people whose first name is James, and the last name is Anderson. However, they have a

different middle initial. All these names can appear in a publication as J. Anderson, and it needs to be resolved that

which J. Anderson is actually intended. The second dataset consists of 9 ambiguous names with each having more

than 10 name variations, as shown in Table 3. These datasets, later on, were used by many other works like [34], [57].

Ferreira et al. [58] also used two datasets. They collected records from DBLP and DBComp. The statistics are given

in Table 4. Many other studies like [34], [57], [59], [60] have used these dataset with some variations. Reuther [61]

investigated the existing test collections and proposed three new test collections to resolve the name variant problem.

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 92

TABLE 2

“J. ANDERSON” PART OF FIRST DATASET USED BY HAN ET AL. [51]

Full Name Affiliation No. of

Pubs

Full Name Affiliation No. of

Pubs

James Nicholas Anderson

UK Edinburgh

James D. Anderson

Univ. of

Toronto

James E. Anderson

Boston College

James P. Anderson

N/A

James A. Anderson

Brown University

James M.

Anderson

N/A

James B. Anderson

Penn. State Univ

James Anderson

James B. Anderson

Univ. of Toronto

James W.

Anderson

Univ. of KY

James B. Anderson

Univ. of Florida

Jim Anderson

Univ. of Southampton

James H. Anderson

Univ. of North Carolina

Jim V. Anderson

Virginia Tech Univ.

James H. Anderson

Stanford Univ.

TABLE 3

SECOND DATASET USED BY HAN ET AL. [51]

Ambiguous Names Name Variations No. of Pubs Ambiguous Names Name Variations No. of

Pubs

S Lee

467

C Lee

152

J Lee

330

A Gupta

332

J Kim

239

J Chen

174

Y Chen

201

H Kim

120

S Kim

181

TABLE 4

DATASETS USED BY FERREIRA ET AL. [58]

DBLP DBComp

Ambiguous Names No. of

Authors

No. of Pubs Ambiguous Names No. of

Authors

No. of Pubs

A. Gupta

576

A. Oliveira

A. Kumar

243

A. Silva

C. Chen

798

F. Silva

D. Johnson

368

J. Oliveira

J. Martin

112

J. Silva

J. Robinson

171

J. Souza

J. Smith

921

L. Silva

K. Tanaka

280

Silva

M. Brown

153

R. Santos

M. Jones

260

R. Silva

M. Miller

405

V. SIMILARITY METRICS

Selecting an appropriate similarity metric/distance function is a technical and challenging task [62] in AND. It is

advisable to employ the best fit similarity measure for each attribute of the publications. No single metric is the best

fit for all the attributes. Cohen et al. [63] compared different similarity metrics for name matching and concluded that

a combination of metrics provides better results than any single metric. Most of the similarity measures do not make

use of the semantics of the publications and use syntactic characteristics only, so we categorize these metrics into two

types (1) syntactic and (2) semantic similarity metrics.

A. SYNTACTIC SIMILARITY METRICS

The similarity metrics that match the strings exactly and do not care about synonymy and polysemy are syntactic

similarity metrics. The similarity of the two publications can be obtained by cosine, Euclidean, Manhattan, Jaccord,

Jaro, Winker, and TFIDF. These metrics often outperform Levenshtein-distance-based techniques [63]. Besides these

metrics, many other measures like typewriter distance, Jaro-Winkler, Monge-Elkan, or phonetic distances can also be

employed. The most used metrics of subcategories are (1) edit distance and (2) token-based distance metrics of

syntactic similarity.

1) EDIT DISTANCE METRICS

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 93

Distance functions map two strings S1 and S2 to a real number r, where a larger value of r indicates greater distance or

smaller similarity between S1 and S2. String distances are most useful for matching problems with little prior

knowledge and/or ill-structured data [63]. A variety of edit distance functions are used in text mining tasks. The edit

distance of two strings (names) is the minimum number of operations required to transform one string to the other.

These operations include insertion, deletion, and replacement of a character. A good comparison of name matching

techniques is given in [63].

The most simple is Levenshtein distance [63] that assigns a unit cost to all edit operations. Monger-Elkan distance

function [64] is more complex and well-tuned with particular cost parameters and is scaled to the interval (0, 1). It is

a variant of the Smith-Waterman distance function [65] and assigns a relatively lower cost to a sequence of insertions

or deletions.

Shu et al. [40], Bhattacharya and Getoor [39], Torvik et al. [55], and Smalheiser and Torvik [32] utilized edit distance-

like measures for measuring name distance of the co-authors of two citations. Shu et al. [40] applied rule-based

methodology along with edit distance.

A little bit similar metric, but not based on the edit distance model is the Jaro metric [66], which is based on the

number and sequence of the common characters between the two strings [37], [42], [53]. A variant of this function is

Jaro-Winkler [67], which exploits the length of the longest common prefix between S1 and S2 [37], [42], [53], [68].

2) TOKEN BASED DISTANCE METRICS

Token-based distance metrics compare words of the two strings S1 and S2 rather than the characters. Euclidean

distance is commonly used for text clustering problems and similarity estimation [28], [36], [54], [57], [69]. Let d1

and d2 represent vectors of two documents then the Euclidean distance between the two documents can be calculated

as:

𝑫𝑰𝑺𝑻𝑬(𝐝𝟏,𝐝𝟐)=∑|𝒘𝒕𝟏 − 𝒘𝒕𝟐

𝒏

𝒕𝟏 |𝟐… …… … …… … (𝟏)

where, term frequency ti ∊ T and T = {t1, . . ., tn}.

Term Frequency Inverse Document Frequency (TFIDF) is the frequency of word w in an attribute of a publication,

and IDF is the inverse of the fraction of words in the dataset that contains w and is used by [34], [37], [42], [53], [70],

[71]Error! Reference source not found.. Cohen et al. [63] considered a soft version of TFIDF in which similar tokens

are also considered along with tokens in S1 ∩ S2. Most of the research works like [37], [38], [40], [51]–[54], [58] use

the cosine similarity that exploits TFIDF and vector space model (VSM) [72]. Normally this function is used for title

and venue attributes. Although, it can be used for any attribute represented in the form of vectors. The documents are

represented in vector space. Let d1 and d2 represent vectors of two documents then the cosine similarity between the

two documents can be calculated as:

𝑆𝐼𝑀(d,d)= 𝐶𝑜𝑠𝑖𝑛𝑒 𝛳 = .

||.||………… ………(2)

Jaccard coefficient, also called the Tanimoto coefficient, is the ratio between the intersection and the union of the

objects. It compares the sum weight of common terms to the sum weight of terms that are present in either of the two

documents except for the common terms [36], [37], [42], [53], [71]. Let d1 and d2 represent vectors of two documents.

The Jaccard coefficient between the two documents is:

𝑆𝐼𝑀(d,d)=.

||||.…………… …… (3)

A document can also be considered as a probability distribution of terms in probability theory. The similarity between

the two documents can be calculated by measuring the distance between the two corresponding probability

distributions. Let d1 and d2 represent vectors of two documents, the KL divergence between the two distributions of

words is calculated as: 𝐷(d|| d)= ∑𝑤



 X 𝑙𝑜𝑔 

 …… … … … … … (4)

The KL divergence is not symmetric on the other hand average KL divergence is symmetric, which is why the average

KL divergence is more popular. The average weighted KL divergence from di to dj is the same as that of from dj to di.

This average weighting between two vectors of the two corresponding documents guarantees symmetry. For text

documents, the average KL divergence between the two distributions of words is calculated as:

𝐷(d|| d)= (⌅X 𝐷 (𝑤||𝑤)



 + (⌅X 𝐷 (𝑤||𝑤)) … … …… … …… (5)

where, ⌅= 

  , ⌅= 

  and 𝑤= ⌅𝑋 𝑤 + ⌅𝑋 𝑤

B. SEMANTIC SIMILARITY METRICS

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 94

The measures discussed above help in estimating pair-wise similarities between the corresponding attributes of the

publications. They usually exploit syntactic characteristics and are unable to utilize the Synonymy and Polysemy-

based semantics of publications. The topic models such as PLSA [73] and LDA [56] provide excellent ways to exploit

semantics. A publication mostly contains multiple topics, and it is important to find the topic similarity between the

two publications. Generally, a topic is a semantically related probabilistic cluster of terms (words). Here, we describe

LDA which can capture semantics in an unsupervised way. It is a generative probabilistic model for text corpora [48],

[56], [74] at the words and documents level. It assumes every document as a mixture of topics and every topic as a

Dirichlet distribution over words in the vocabulary. It has been used for finding topic similarity among the publications

[28], [39], [51]. Shu et al. [40] and Song et al. [28] extend the LDA model and apply it to AND. The probability of

generating word w from document d is given as:

𝑃(𝑤|𝑑,𝛳, 𝛷)=∑



 𝑃(𝑤|𝑧, 𝛷)𝑃(𝑧|𝑑,𝛳)……………(6)

Where, w is vector form of d, z is topic and 𝜭𝒅,𝜱𝒛 are multiple distributions over topics and over words specific to z,

simultaneously.

VI. APPROACHES FOR ANDBD

Much research work has been done on entity resolution in a variety of research areas. In the field of databases, studies

are made on merge/purge [75], record linkage [76], duplicate record detection [77], data association [78] and database

hardening [79]. In Natural Language Processing (NLP), Cross-Document Co-Reference [80] methodologies and name

matching algorithms [44] are designed. In BD, several methods or models are employed, such as, citation matching

[81], k-way spectral clustering [34], social network similarity [35], mixed and split citation [37], Latent Topic Model

[40], latent Dirichlet model [39], Random Forests [42], Graph-based GHOST [43] and Ontology-based Category

Utility [82].

A variety of solutions [32] [72] ranging from the manual assignment by librarians [34], [83] to unsupervised learning

are provided for AND. Most of the researchers categorize ANDBD in supervised, unsupervised, and semi-supervised

learning methods. The graph-based and ontology-based methods have also been applied to resolve AND. We have

classified methods for AND in the following five categories. Each category is explained in chronological order with

discussions about its pros and cons.

A. SUPERVISED LEARNING METHODS

In supervised learning [42], [51], [55], [57], [84]–[86], the major objective is to find class labels by exploiting the

related information. Supervised learning is labor-intensive, costly, and error-prone if labeling or training of the dataset

is not performed properly. Supervised learning methods achieve better performance as compared to those of

unsupervised learning methods with the tradeoff of expensive labeling labor and time consumed. Supervised methods

may be exploited to predict an author's name in a citation [51] or to disambiguate publications of a particular author

[42], [55], [84], [85].

Han et al. [51] proposed two supervised methods to disambiguate author names in the publications using VSM [72],

[87] for the representation of publications; and cosine similarity for calculating the pair-wise similarity of publication

attributes. They propose canonical names by grouping together author names with the same first name initial and the

same last name. Each canonical name is associated with all those publications, where that name appeared. First method

applies naive Bayes probability model [88] and the second Support Vector Machines (SVMs) [89]. Both methods

exploit triplet1 attributes for similarity calculations. This famous work is the enhancement of Han et al. [90] where

they exploited k-means clustering along with the Naïve Bayes model using the same dataset and attribute set.

Torvik et al. [55] proposed an authority control framework to resolve only the name-sharing problem for MEDLINE

records by using eight different attributes. They calculated the pair-wise similarity profile based on these attributes

and decide whether a pair of publications containing the same name of an author belongs to a single individual. Culotta

et al. [84] proposed a method that overcomes the problem of transitivity produced due to pair-wise comparisons. A

researcher can have multiple papers, email addresses, and affiliations. While comparing the publications of such

authors the pair-wise classifier cannot handle multiple instances of an attribute. They employed the sets rather than

pair-wise comparisons and addressed the transitivity issue between co-authors in a better way. The comparison of a

1 In this article we refer co-authors, title, and venue attributes as triplet attributes.

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 95

new publication is made with all the publications in a cluster rather than the pair-wise comparisons. By comparing a

publication with sets makes it possible to handle the multiple values of an attribute.

Yin et al. [36] focused name-sharing problem by considering only identical names. DISTINCT, an object distinction

methodology to disambiguate authors is proposed. They combine set resemblance of neighbor tuples and random walk

probability between the two records of a relational database. SVM [89] is applied to assign weights to various types

of links in the graph and agglomerative hierarchical clustering to get final clusters.

Torvik and Smalheiser [85] enhance their work [55] by (a) including first name and its variants, emails, and

correlations between last names and affiliation words; (b) employing new procedures of constructing huge training

sets; (c) exploiting methods for calculating the prior probability; (d) correcting transitivity violations by a weighted

least squares algorithm; and (e) using an agglomerative algorithm based on maximum likelihood for calculating

clusters of articles that represent authors. The work proposed in [55] was not scalable which is usually a problem of

most AND methods. The above enhancements make it scalable for a huge dataset like MEDLINE records.

Pucktada and Giles [42] resolve the name-sharing problem in MEDLINE records. They introduce Random Forest

classifier to find a high-quality pair-wise linkage function. They define similarity profile by considering 21 attributes

categorizing them into six types of attributes; three of them are triplets and the other three are: affiliation similarity,

concept similarity, and author similarity. They use a naive-based blocking procedure. This procedure uses the author’s

last name and the first initial to block the author’s name that does not share both parts of the author’s name. They

compare the results with SVM. Their results show that Random Forests outperform SVM.

Qian et al. [86] proposed Labeling Oriented Author Disambiguation (LOAD) to resolve author name disambiguation

problem. LOAD exploits supervised training for estimating similarity between publications using High Precision

Clusters (HPCs) for each author to change the labeling granularity from individual publications to clusters. Labeling

HPCs decreases labeling effort at least 10 times as compared to the labeling publications. Found HPCs are clustered

into High Recall Clusters (HRCs) to place all publications of one author into the same cluster. For pair-wise

comparisons, LOAD employs rich features like name, email, affiliation, homepage between two authors, co-author

name, co-author email, co-author affiliation, co-author homepage, title bigram, reference, and download link. Besides,

self-citation and publishing year, the interval between two papers are also considered.

The methods discussed above perform name disambiguation in an offline environment. Different from them, Sun et

al. [91] proposed a publication analysis system. The focus of the system was to decide, at query time by involving the

user, if the queried author name matches the given set of publications retrieved from the Google Scholar database.

The system exploits two kinds of heuristic features (1) number of publications per name variation, and (2) publication

topic consistency. Topic consistency exploits discipline tags crowd-sourced from the users of the Scholarometer

system [92]. They train the binary classifier on a dataset of 500 top-ranked authors from scholarometer database1 by

manually labeling either ambiguous or unambiguous, and examine the publications retrieved from Google Scholar for

each queried name. To the best of our knowledge, this is the first work addressing real-time author name

disambiguation and achieves 75% accuracy.

Zhang et al. [93] proposed a Bayesian non-exhaustive classification method for resolving online name disambiguation

problems. They considered a case study for bibliographic data and involved a temporal stream format for

disambiguating authors by dividing their papers into similar groups. Table 5 provides a quick summary of the methods

based on supervised learning models.

TABLE 5

SUMMARY OF SUPERVISED LEARNING METHODS

Reference

# Problem Tool / Method Attributes /

features Compariso n with Dataset Finding Limitation

Han et al.

[51] 2004

Disambiguate names in

citations

Naive Bayes

probability model,

SVM

Co-author

names, paper

title, venue

Comparison of

both approaches

and their hybrid

approach

Publications

from web,

DBLP

Hybrid of naive

Bayes outperforms

Hybrid I scheme of

SVM

Not flexible, not

topic sensitive

Torvik et al.

[55] 2005 Resolve name sharing Authority control

framework

8 different

attributes

Comparison is

performed with

manually labeled

data only

Medline

Different articles

authored by the

same individual will

share similarity in

one or more aspect

of Medline records

No comparison with

state-of-the-art,

Specific to Medline

records only

1 scholarometer.indiana.edu

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 96

Culotta et

al. [84]

2007

Transitivity due to pair-

wise comparisons

Supervised machine

learning, error-

driven, rank-based

training

Examining sets

of records not

pairs

Approach is

evaluated on

three different

datasets

Penn,

Rexa,

DBLP

Error reduction of

60% over standard

binary classification

approach

Not topic sensitive,

Not compared with

state-of-the-art

Yin et al.

[36] 2007 Name sharing problem

Supervised and un-

supervised set

resemblance and

random walk

Fusion of

different type of

subtle linkages

Comparison of

both approaches

and their hybrid

approach

DBLP

Fusing difference

type of linkages and

combining set

resemblance of

neighbor tuples and

random walk

probability is

effective

Not compared with

state-of-the-art,

Specific to authors

with identical name

only

Torvik and

Smalheiser

[85] 2009

Enhancement of [23]

Estimating the

probability that two

articles sharing

same name, were

written by same

individual

Adding 5 more

variants to [23] [23] Medline

Author-ity model

with more scalability

and recall

Not high

performance, model

will fail to apply to

scientists whose

research output is

diverse

Pucktada

and Giles

[42] 2009

Name sharing problem

Random Forest

classifier, naive

based blocking

21 different

attributes SVM Medline

Random Forest

classifier

outperforms SVM

High accuracy can

be achieved with a

relatively small set

of features.

Qian et al.

[86] 2011

Labeling Oriented

Author Disambiguation

Estimating similarity

between

publications using

High Precision

Clusters

Set of rich

features

Human labeling

after conventional

automatic author

disambiguation

CS, UE and

DBLP

Machine Learning

combined with ceiv

judgement produce

more accurate

results to assist and

reduce human

labeling

No Iterative process

for AND, Limited

usage of feature

sources, non usage

of direct

optimization

algorithms

Sun et al.

[91] 2011

Detect ambiguous

names at query time

Finding ambiguities

from crowdsourced

annotations

Number of

citations per

name variation,

publication topic

consistency

For each

combination of

features,

accuracy, area

under curve and

Papers

retrieved

from google

scholar

Improved accuracy

Publication

metadata was not

considered

Zhang et al.

[93] 2016

Online name entity

disambiguation

Dirichlet process

prior with a Normal

× Normal × Inverse

Wishart data model

Temporal

stream format

Qian’s Method

[63], Khabsa’s

method [64]

AMiner

Proposed method

outperforms the

state-of-the-art

methods

Computational

complexity depends

upon several factors

and can be variable

B. UNSUPERVISED LEARNING METHODS

Unsupervised learning methods [28], [34], [35], [39], [59], [60], [70], [94]–[99] do not need manual labeling. Instead,

they carefully choose features to classify similar entities into clusters. Various clustering algorithms are applied to

cluster similar entities. Giles et al. [34] apply a k-way spectral clustering method to resolve AND. Unsupervised

learning methods save labeling time with the tradeoff of efficiency and precision. However, in many dynamic

scenarios, unsupervised learning methods are better solution than supervised learning methods.

The unsupervised methods may utilize similarities between publications with the help of a predefined set of similarity

functions to group the publications for a particular author. These functions are usually defined over the features present

in the publications [34], [35], [59], [94]–[97]. These features are also called the local information [40] as they are

apparently available in the publication. The similarity functions may also be defined over implicit information such

as topics of the publication [36], [40], [60] or Web data [60], [98], [99]. The information about the topic(s) of the

publication is not explicitly present in the publication under consideration rather it is derived from the dataset hence

called the global information [40].

Giles et al. [34] improved their previous work [51] by applying k-way spectral clustering [34] for AND using the

triplet attributes for similarity measuring. Malin [35] applied hierarchical clustering and random walk to resolve name

sharing and name variant problems. The main limitation of this method is a static threshold which is used as a stopping

criterion of the clustering process. Bekkerman and McCallum [70] resolve the name ambiguity problem. They present

two frameworks: the first one uses the link structure of Web pages, and the second exploits A/CDC (Agglomerative /

Conglomerative Double Clustering). Their methods require a minimum of the prior knowledge as provided in BD.

However, their methods best fit web appearances instead of BD.

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 97

Bhattacharya and Getoor [39] referred AND as entity resolution problems and extend LDA topic model [56]. They

suppose that authors who belong to one or more groups of authors, may co-author papers and simultaneously discover

the clusters of authors and clusters of papers written by these authors. They perform parameter estimation through

Expectation-Maximization (EM) algorithm along with Gibbs sampling [100]. The extended model is about 100 times

slower than an alternative method [95] and solves only the name variant problem. Bhattacharya and Getoor [95]

proposed a collective entity resolution method as an improvement to their previous work [39]. Given two papers both

written by authors a1 and a2, if the two instances of a2 refer to the same individual, then it is likely that both instances

of a1 refer to the same entity. Resolving this 2nd level ambiguity helps in cases where there is a high level of ambiguity.

They treat high versus low ambiguity scenarios separately. They first address the most confident assignments and then

less confident ones. The final similarity value between the two citations is calculated based on pair-wise comparisons

and previously disambiguated authors. The weighting parameter is adjusted manually, and it may take different

optimal values across different contexts. Although this method is an advancement to their previous work [39] yet

scalability was still a problem.

Cota et al. [96] proposed a heuristic-based hierarchical clustering that successively combines clusters of citation

records of the ambiguous authors. In the first step, the compatibility of the ambiguous author names was found. If the

two names in two publications are compatible, then they are further compared against common compatible co-

author(s). The two publications are merged to a cluster if a compatible co-author is found, else they form separate

clusters. The resulting clusters are almost pure but fragmented. To decrease the fragmentation, they use the second

step in which clusters are compared in a pair-wise fashion exploiting title and venue attributes. The major distinction

of this method was that it compares all the titles and venues of a cluster with that of other clusters applying bag of

words approach. If the similarity between two clusters reaches a threshold value, then they are fused to one cluster

otherwise they remain separate clusters. They claim improvements up to 12% against non-hierarchical clustering, 21%

against SVM, and 15.5% against K-means using the same attributes.

Song et al. [28] proposed an algorithm based on Probabilistic Latent Semantic Analysis [73] and Latent Dirichlet

Allocation [56] to deal with AND exploiting the contents of the articles. They exploited metadata of publications and

authors and publication’s first page to relate authors to topics.

Shin et al. [101] proposed AND framework by constructing a social network for finding semantic relationships

between authors and solves name sharing and name variant problems simultaneously. They employ two methods: one

for namesake names and the other for heteronymous names. A social network is constructed in three steps. (1)

Information extraction: extraction of paper title. (2) Candidate topics extraction: extraction of topics that are

representative of the publication. These candidate topics are extracted from the abstract of the publication using

morphemic analysis [102]. (3) Social network construction: the social network is constructed based on the above two

types of information. They used the cosine similarity metric for finding similarity among two social networks.

Yang and Wu [103] resolves name sharing problem by exploiting triplet attributes along with web attributes. They

use Cosine and Modified Sigmoid Function (MSF) for triplet attributes, and Maximum Normalized Document

Frequency (MNDF) for web attribute, to estimate the pair-wise similarity between the publications. They also

employed a binary classifier to reduce the noise in the clustering publications.

Tang et al. [29] formalize the problems for name disambiguation in a unified probabilistic framework. The framework

uses a Markov Random Fields (MRF) [104] exploiting six local (publication) attributes (content based information)

and five relationships (structure based information) between the pair of publications. The framework, on one hand,

achieves better accuracy than baselines but, on the other hand, its time complexity is almost twice as compared to

baselines.

Wu et al. [105] used Dempster-Shafer theory (DST) for AND. They proposed an unsupervised DST based hierarchical

agglomerative clustering algorithm which is used with a combination of Shannon’s entropy to blend disambiguation

attributes for more reliable candidate pair of clusters for union in each repetition of clustering. Qian et al. [106]

proposed a dynamic method for author name disambiguation keeping the growing nature of digital libraries in mind.

They proposed a two-step process, BatchAD+IncAD, which first performs AND by grouping all records into disjoint

clusters, and then it periodically performs incremental AND for newly added papers and determines that new papers

belong to an existing cluster or forms a new one. Khabsa et al. [107] proposed a constraint-based clustering algorithm,

that allows constraints to be added to the clustering process and allowing the data to be added as well, in an incremental

way. This methodology helps the users by allowing them to make corrections to disambiguated results. The method

is based on a combination of DBSCAN and pairwise distance based on random forests. Sun et al. [108] proposed an

unsupervised method based on topological features AND solution. To measure the similarity of publications the

method includes a structure similarity algorithm along with a random walk with restarts. Table 6 includes a summary

of methods that involve unsupervised learning methods for AND.

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 98

TABLE 6

SUMMARY OF UNSUPERVISED LEARNING METHODS

Reference # Problem Tool / Method Attributes /

features

Comparison

with

Dataset Finding Limitation

Glies et al. [34]

2005

Disambiguation in

Author Citations

K-way Spectral

Clustering

Co-author

names, paper

titles, and

publication

venue titles

Evaluation

based on

confusion matrix

DBLP Spectral

methods

outperform k-

means

Not compared with any

state-of-the-art

Malin [35] 2005 Name sharing

and name variant

problems

Hierarchical

clustering and

random walk

Actor lists for

movies and

television

shows

Consideration

as baseline 1)

ambiguous

names are

distinct entities

2) ambiguous

names are

single entity

IMDB Mea suring

similarity based

on community,

rather than

exact similarity

is more robust

Not compared with any

state-of-the-art

Bekkerman and

mccallum [70]

2005

Finding Web

appearances of a

group of people.

Link structure of the

Web pages, another

using

Agglomerative/Cong

lomerative Double

Clustering (A/CDC)

Only affiliation

of a person

with a group is

required

Traditional

agglomerative

clustering

Hand-labeled

a dataset of

over 1000Web

pages

Improved F

measure

Relational structure of

relevant classes is not

considered

Bhattacharya

and Getoor [39]

2006

Entity resolution Probabilistic model,

extended LDA

Decisions not

on independent

pairwise basis,

but made

collectively

Hybrid softtf-IDF

[31]

Citeseer, arxiv

(HEP)

Exploits

collaborative

group structure

for making

resolution

decisions

Cannot resolve multiple

entity classes

Bhattacharya

and Getoor [95]

2007

Entity resolution Relational clustering

algorithm

Attribute-based

baselines

Attribute-based

entity resolution,

naïve relational

entity resolution,

collective

relational entity

resolution

Citeseer, arxiv,

biobase

Improved

performance

over baselines

Manually adjusted

weighting parameter

which can have different

optimal values. Not

scalable

Cota et al. [96]

2007

Disambiguation in

split citation and

mixed citation

Heuristic-based

hierarchical

clustering

Authors, title of

the work,

publication

venue

SVM, K-Means DBLP Improved

performance

over baselines

Compared only with

unsupervised methods

Song et al. [28]

2007

Disambiguation

exploiting

contents of the

articles

Two stage approach

based on LDA and

PLSA

Person names

within web

pages and

scientific

documents

Spectral

clustering and

DBSCAN

Citeseer Improved

scalability

Compared only with

unsupervised methods

Shin et al. [101]

2010

Finding semantic

relationships

between authors

and name

sharing

Methods for

namesake names

and heteronymous

names

Paper titles and

topics

Comparison

among two

social networks

with cosine

similarity

DBLP Improved

effectiveness

Yang and Wu

[103] 2011

Name sharing

problem

Cosine, Modified

Sigmoid Function,

and Maximum

Normalized

Document

Frequency

Triplet

attributes along

with web

attribute

Compared with

[34]

DBLP Dataset

constructed by

[34]

Improved

accuracy

Cluster separator filtered

out some correctly

matched pairs from the

datasets

Tang et al. [29]

2012

Disambiguation,

how to find

number of people

“K”

Probabilistic

Framework

Attributes of

publications

and

relationships

Four baseline

methods

AMiner Performs better

than baseline

and “K” is close

to real

Wu et al. [105]

2014

Name

disambiguation

DST based

unsupervised

hierarchical

Three

unsupervised

models

Performance

comparable to a

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 99

agglomerative

clustering

supervised

model

Qian et al. [106]

2015

Dynamic

disambiguation

Batchad+incad

framework

Authors

metadata

Five state-of-

the-art batch AD

methods

Two labeled

data sets,

case study

and DBLP

Improved

efficiency and

accuracy

Erroneous results when

an author changes

affiliation or topic

Khabsa et al.

[107] 2015

Disambiguation

with constraints

DBSCAN and

pairwise distance

based on random

forests.

Metadata

information and

citation

similarity

Models with

different

combination of

features

Citeseer Improved

pairwise and

cluster F1

DBSCAN cannot split an

impure cluster

A. SEMI-SUPERVISED METHODS

Semi-supervised Learning approaches [58] have also been applied to AND in BD. It combines the characteristics of

both supervised and unsupervised methods.

On et al. [53] proposed the framework for resolving the name variant problem in two steps: (1) blocking and (2)

distance measurement. They used four blocking methods that reduce the candidates, and seven unsupervised distance

measurements that measure the distance between the two candidate publications to decide whether they belong to the

same entity. They also exploit two supervised algorithms Naive Bayes model [88] and the Support Vector Machines

(SVMs) [89] to separate the publications of an author in a separate cluster.

Lee et al. [37] called the name sharing problem as a mixed citation and name variant as a split citation problem. They

used Naive Bayes model and SVM (supervised methods); and cosine, TFIDF, Jaccard, Jaro and JaroWinkler

(unsupervised methods) to resolve the name disambiguation problem.

On et el. [71] again focused on the name variant problem and call it Grouped-Entity Resolution (GER) problem. They

propose Quasi-Clique, a graph partition-based method. Unlike previous text similarity approaches like string distance,

TFIDF or vector-based cosine metric, their approach investigates the hidden relationship under the grouped entities

using Quasi-Clique technique.

Huang et al. [109] resolve both types of problems on a small dataset selected from CiteSeer. They employed an online

SVM algorithm (LASVM) as a supervised learner of finding the distance metric of the publication attributes by pair-

wise comparisons. The supervised learner easily handles the new papers with online learning. For clustering the

publications of the authors, they used DBSCAN algorithm that constructs the clusters on multiple pair-wise similarities

and handles the transitivity problem. They use different similarity metrics for different attributes, e.g., edit distance

for URLs and emails, Jaccard similarity for affiliations and addresses, and Soft-TFIDF [110] for author names.

Zhang et al. [54] proposed a semi-supervised name disambiguation probabilistic model with six constraints. They

consider following constraints: (1-3) triplet attributes constraints; (4) CoOrg, principal authors of two papers are from

the same organization; (5) citation, one publication cites the other; (6) τ-CoAuthor, two of the co-authors (one from

each publication) are not same but they appear in another publication as co-authors. They applied Hidden Markov

Random Fields for AND on AMiner1 data. Their model combines six types of constraints with Euclidean distance and

facilitates the user to refine the results.

Wang et al. [111] proposed a two-step semi-supervised method for AND that resolves name sharing problem only for

identical names in AMiner2. They propose atomic clusters, i.e., each cluster has the publications of a particular author.

At first step, they use a bias classifier to find the atomic clusters. They use a list of publications having the ambiguous

author name and triplet attributes of the publications as input to the classifier. In the second step, they integrate the

atomic clustering results into the Hierarchical and K-means clustering algorithms.

Wang et al. [52] proposed constraint based topic modeling (CbTM) method as an extension of [54]. They assume that

if a pair of publications satisfy a constraint, then both the publications should have more chances to have similar topic

distribution. They combine the original likelihood function of LDA with a set of constraints defined over the attributes

available from the publication’s dataset. Thus, the likelihood function is also affected by the constraints. They define

the constraints as set of constraint functions each having value either 0 or 1. The presence of a constraint in the pair

of publications under consideration means the function has value 1 otherwise 0. They define five constraints; two of

1 http://AMiner.org

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 100

them belong to triplet attributes excluding the title attribute and other three are: indirect co-author or transitive co-

author (it is actually the τ-CoAuthor constraint defined in [54]); web constraint (it means that two publications appear

in the same web page) and user feedback (what the users comment about two publication’s authors). In the end,

agglomerative hierarchical clustering algorithm is employed to construct clusters to uniquely identify authors

containing all their publications.

Shu et al. [40] proposed LDA-dual topic model for complete entity resolution. They categorize AND into three types:

name sharing, name variant, and name mixing. They introduce the concept of global information based on the words

and author names present in the dataset. In LDA-dual they define topics as two Dirichlet distributions, one over words

and the other over author names, characterizing topics as a series of words and author names. They also consider local

information like paper titles and co-authors. Along with triplet attributes they use topic similarity and minimum name

distance. They claim that two publications share little local information as compared to that of global information and

employed Metropolis-Hasting within Gibbs sampling to calculate the global information i.e., model hyperparameters:

α, β, and γ. The complete process consisted of following steps: (1) find topics of publication in the dataset using Gibbs

sampling; (2) construct a pair-wise classifier of two publications; (3) resolve name sharing problem with the help of

spectral clustering and classifier’s support for each ambiguous author name; (4) solve the name variant and name

mixing problem with help of the classifier.

Ferreira et al. [58] proposed Self-training Associative Name Disambiguation, a hybrid name disambiguation method.

In the first (unsupervised) step clusters of authorship, records are formed utilizing persistent patterns in the co-

authorship graph. In the second (supervised) step training is performed through a subset of clusters constructed in the

first step deriving the disambiguation function.

Arif et al. [112] proposed an enhanced version of the vector space model for AND in digital libraries. Along with the

normal authorship attributes, they added the additional information from the paper’s metadata, including email ID,

affiliation of authors, and co-authors as well. These additional features have greatly improved the performance of the

method. Table 7 shows the summary of name disambiguation methods that involve semi-supervised learning.

TABLE 7

SUMMARY OF SEMI-SUPERVISED LEARNING METHODS

Reference

Problem Tool / Method Attributes /

features

Comparison

with

Dataset Finding Limitation

On et al.

[53] 2005

Name variant problem (1) blocking and (2)

distance measurement, 7

supervised and 2

unsupervised algorithms

Co-author

relationships

Four alternatives

using three

representative

metrics

DBLP, e-Print,

biomed,

econpapers

Using coauthor

relation (instead

of author name

alone) shows

improved

scalability and

accuracy

It is a two-

step

approach

and shows

improvement

over one-

step

approach

Lee et al.

[37] 2005

Mixed citations and

split citations

Sampling-based

approximate join

algorithm, 2 supervised

and 5 unsupervised

Associated

information of

author names

Four alternatives

using three

representative

metrics

DBLP, e-Print,

biomed,

econpapers

Improved

accuracy

Accuracy for

e-print is

lower as

compared to

DBLP’s

accuracy

On et el.

[71] 2006

Name variant Graph partition-based

method Quasi-Clique

Contextual

information

mined from the

group of

elements

Quasi-Clique

experimented on

different real and

synthetic

datasets

ACM, biomed,

IMDB

Improves

precision and

recall with

existing ER

solutions

Performance

is better for

IMDB but not

for Citations

data which

has more

strong

connections

as compared

to actors in

IMDB

Huang et

al. [109]

2006

Name sharing, and

name variant problem

LASVM and DBSCAN Author and

papers metadata

Traditional svms Citeseer Improved

efficiency and

effectiveness

Zhang et

al. [54]

2007

Name disambiguation Semi-supervised

probabilistic model

6 different

features from

authors and

Blocking and

distance measure

for co-authors

AMiner Improved

scalability and

accuracy

Compared

only with

unsupervised

hierarchical

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 101

citation

information

clustering

methods

Wang et al.

[111] 2008

Name sharing problem Two-step semi-

supervised method

Atomic clusters

with citations of

a particular

author

Hierarchical

clustering and K-

means

AMiner Concept of

atomic clusters

produce better

results. Co-

author features

are important for

atomic clusters

Compared

only with

unsupervised

hierarchical

clustering

methods

Shu et al.

[40] 2009

Name sharing, name

variant and name

mixing

LDA-dual topic model Generative

latent topic

model that

involves both

author names

and words

Experiments on

three different

training data sets

DBLP Improved

accuracy

Smoothing

method for

new words

and author

names does

not scale

Ferreira et

al. [58]

2010

Name disambiguation Self-training Associative

Name Disambiguation

(SAND)

Authorship

records

Two supervised

and two

unsupervised

methods

DBLP, bdbcomp Improved results

as compared to

baselines

improvement

when

compared

with

unsupervised

methods as

compared to

the case of

supervised

methods

Wang et al.

[52] 2010

Name sharing problem Constraint based topic

modeling

Combine the

original

likelihood

function of LDA

with a set of

constraints

Hierarchical

clustering

algorithm to

group the papers

into clusters

AMiner Improved

precision, recall

and F1

Arif et al.

[112] 2014

Mixed citation and split

citations problem

Enhanced vector space

model

Additional

attributes like e-

mail ID and

affiliation of

author and co-

authors

Comparisons of

real authors

names with

names generated

by proposed

method

IEEE Improved F

measure

Not tested

against any

baseline or

state-of-the-

art

B. GRAPH-BASED METHODS

The graph-based methods are popular for AND. Many authors employ a co-authorship graph to capture the similarity

between two entities. It has been adopted by many methods discussed above, such as relational similarity in

Bhattacharya and Getoor [95] and Yin et al. [36]; inter-object connection strength in Kalashnikov and Mehrotra [113],

Yin et al. [36], and Chen et al. [114]; and semantic association in Jin et al. [115]. The length of the shortest path in a

graph is usually employed to estimate the degree of closeness between two nodes. Kalashnikov and Mehrotra [113]

and Yin et al. [36] utilized connection strength to find the similarity of two nodes connected through relationships.

For this purpose Kalashnikov and Mehrotra [113] exploit legal paths and Fan et al. [43] make use of valid paths.

Bhattacharya and Getoor [95] employed collaboration paths of length three and assign equal weights to all paths

regardless of their length. Kalashnikov and Mehrotra [113] proposed a more complicated method to calculate the

weights for connection strengths. They proposed multiple equations and an iterative method to determine the weights.

Differently, On et al. [71] used Quasi-Clique, a graph mining technique [116] to take advantage of the contextual

similarity in addition to syntactic similarity. On et al. [71], Chen et al. [114] and Jin et al. [115] estimate the similarity

between two nodes (authors) as a combination of the feature-based similarity and the connection strength of the graph.

Chen et al. [114] estimate the connection strength between two nodes as the sum of connection strengths of all simple

paths no longer than a user-defined length.

In the above paragraph, we presented a short but comparative description of some of the graph-based works in AND.

Now the details of each work are discussed. McRae-Spencer and Shadbolt [117] resolved the AND on large-scale

citation networks through graph-based methods exploiting self-citation, co-authorship, and publication source

analyses in three passes to tie the papers of a particular author in a collection assigned to that author. The first pass is

to test each paper in the ambiguous name cluster against every other paper within that cluster to see if the second paper

is the self-citation of the first, or vice versa. Similarly, the second pass is performed to draw a co-authorship graph,

and the third pass used source URL metadata. The output of these three passes is the graphical representation of the

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 102

publications. This method was based on metadata rather than textual context and on the notion that authors cite their

previous publications. This method used self-citation as an attribute so the new papers have fewer or may have no

citations at all. The papers of an author, written just before his/her retirement1 or death will never have self-citations.

Similarly, the papers written just before the change of research area will be self-cited hardly ever.

Galvez and Aneg´on [41] addressed the conflation of personal name variants problem in a standard or canonical form

exploiting finite-state transducers and binary matrices. They divide the variants into valid (the variation among

legitimate variants and canonical forms, e.g., such as the lack of some components of a full name, the absence or use

of punctuation marks, and the use of initials) and non-valid (the variation among non-legitimate variants and correct

forms, e.g., miss-spellings, involving deletions or insertions of characters in the strings, nicknames, abbreviations, and

errors of accentuation in the names from certain languages) categories. They identify and conflate only valid variants

into equivalence classes and canonical forms.

Yin et al. [36] proposed DISTINCT, an object distinction methodology to solve AND, where entities have identical

names. The method combines set resemblance of neighbor tuples and random walk probability (between two records

in the graph of relational data) to measure relational similarity between the records of the relational database. These

two methods are complementary: one exploits the neighborhood information of the two records, and the other uses

connection strength of linkages by assigning weights. DISTINCT exploits several types of linkages, like title, venue,

publisher, year, and author’s affiliation.

Jin et al. [115] proposed Semantic Association AND graphical method. The similarity between the attributes (expect

co-authors) of the two publications is measured through VSM, and the term TF-IDF is applied for term weighting.

For co-authors and transitive co-authors, semantic association graphs are constructed. The nodes show authors, and

the edges show the association. The edges also determine the weight by counting the number of publications co-

authored by two authors. It is a two-step process, RSAC (Related Semantic Association based Clustering) and SAM

(Semantic Association based Merging). RSAC clusters two publications in a group if the co-authorship graphs of the

two publications are similar, i.e., they have common co-authors. Similarly, all the publications are grouped in small

clusters. Transitivity property may hold true for co-authors of some publications, but RSAC does not handle it, and

all the publications of an author may be assigned to multiple groups. To handle this issue SAM merges the groups

based on similarity values calculated for literature (titles + abstracts), affiliations, and transitive co-authorship graphs.

Fan et al. [43] resolved name sharing problems through GHOST (GrapHical framewOrk for name diSambiguaTion)

using only co-authorship attributes, however for dense authors they exploited user feedback too. Contrary to the

methods of Chen et al. [114] and Jin et al. [115], GHOST does not take into account the feature-based similarity, and

the connection strength between nodes u and v is measured using Ohm’s Law-like formula defined over a subset of

valid paths. Another difference of this work from the work in [115] is that it does not model the transitive co-authorship

graph. This work has two strengths. First, the time complexity is very low as compared to the previous works as it

exploits only co-author attribute and achieves 94% precision on average. Second, GHOST employs Ohm’s Law-like

formula to compute the similarity between any pair of nodes in a co-authorship graph. The drawback of GHOST is

that the results for dense authors are not in line with the results of non-dense authors. Fan et al. [43] proposed user

feedback for such authors. No doubt the results are improved but the scalability is a challenge here because in real life

databases there may be thousands of dense authors.

Wang et al. [87] proposed active user name disambiguation (ADANA) exploiting a pair-wise factor graph (PFG)

model which can automatically determine the number of distinct names. Based on PFG model, they introduce a

disambiguation algorithm that improves performance through user interaction.

Shin et al. [118] proposed a graph based model called Graph Framework for Author Disambiguation (GFAD), which

involves co-author relations while constructing graphs and ambiguity is removed by vertex splitting and merging

based on the co-authorship. Table 8 provides a summary of methods that involve the use of graph-based models.

TABLE 8

SUMMARY OF GRAPH BASED METHODS

Reference # Problem Tool / Method Attributes /

features

Comparison with Dataset Fi nding Limitation

Mcrae-Spencer

and Shadbolt

[117] 2006

Name

disambiguation

Citation graph Self-citation, co-

authorship

And document

source analyses

Precision, recall an

df1 for 8 name

based clusters

Citeseer Slightly improved

results in terms of

usefulness

Needs to

create

correction

facility within

1 By the term “retirement” we do not mean the retirement from job rather we mean retirement from research work

willingly or unwillingly due to any reason.

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 103

some tested

services

Galvez and

Aneg´on [41]

2007

Personal name

variants problem

Standard or

canonical form

exploiting finite-

state transducers

and binary

matrices

Author names Application of

master gr aph to

the lists of author

indexes

LISA, SCI-E. Improved

precision, Recall

and F1, reduced

erroneous

analysis

Similarity

measures

needs

improvement

in terms of

error

margins

Jin et al. [115]

2009

Name

disambiguation

Semantic

Association based

Name

Disambiguation

method (SAND),

Semantic

association

graphs

DISTINCT [36],

aktiveauthor [117]

Citesseer, DBLP,

Libra

Improved

accuracy

Fan et al. [43]

2011

Name

disambiguation

Graphical

framework for

name

disambiguation

(GHOST)

Feature-based

similarity, and the

connection

strength between

nodes based on

authorship

2 labeled authors

for DBLP and 8

labeled authors for

pubmed for

comparison,

DISTINCT

[36]

DBLP, pubmed High precision

and recall

Performance

May suffer for

rare dense

authors

Wang et al. [87]

2011

Active name

disambiguation

ADANA using pair-

wise factor graph

Active user

interactions

4 baseline

methods

Publication data

set, a web page

data

Set, and a news

page data set

Reduced error

rate

Error rate has

been

decreased

with the help

of user

corrections

Shin et al. [118]

2014

Namesake

problem

Graph Framework

for Author

Disambiguation

Co-author

relations

3 representative

unsupervised

methods

DBLP, AMiner Improved

performance

C. ONTOLOGY-BASED METHODS

In information science, ontology is basically the knowledge of concepts and the relationships between those concepts

within a domain. In other words, it is knowledge representation of a domain. Ontology-based AND has been exploited

by many researchers in different fields. For example, Geographic Named Entity Disambiguation [119], Identity

Resolution Framework (IdRF) [120], Named Entity Disambiguation exploiting Wikipedia [121], [122], Entity Co-

reference [92]. As far as digital libraries or BD are concerned, researchers paid little attention to this kind of methods.

Initially, Hassell et al. [123] resolved AND through already populated ontology extracted from the DBLP. They utilize

a file from DBLP that contains entities like authors, conferences, and journals, and convert it into RDF and used it as

background knowledge. Their method takes a set of documents from DBWorld1 posts, “call for papers” to

disambiguate the authors. Each such document contains multiple authors, say, the committee members, and some

information about them, like affiliation, and information about the venue like topics of the venue. The scenario of the

method is different from those we have discussed throughout this article. All other approaches perform disambiguation

by either predicting the most probable author of a publication or by grouping the publications of the same author in a

unique cluster in BD. Different from those, this method pinpoints, with high accuracy, the correct author in the DBLP

ontology file that a document of DBWorld refers to. Their method selects an author name from the document and

searches the candidate authors in the populated ontology in RDF form. All the candidate authors are compared with

the author in the document to predict the most confident author in the ontology that relates to the author in the

document. Different types of relationships in the ontology are exploited to predict the correct author out of various

matches (candidates) in the ontology. These relationships include entity name, text proximity, text co-occurrence,

popular entities, and semantic relationships. Name entity refers to specifying which entities from the populated

ontology are to be spotted in the text of the document and later disambiguated as all the entities of the document may

not present in the DBLP ontology. Text proximity is the number of space characters between the name entity and the

known affiliation. Here known affiliation means the object already known by the ontology as affiliation, say, name of

a university. In DBWorld postings, affiliations are usually written next to the entity name. If an entity name in the

document and the affiliation matches the author name and known affiliation in the ontology, there are chances that

these two entities are the same real-world entity. Text co-occurrence is utilized to match the research areas of the

candidate authors in the ontology and the topics of the venue present in the posting. A popular entity is an author in

the ontology that has the highest score of publications among the candidate authors. Semantic relationships are used

to match the co-authors of the candidate authors in the ontology and the entities in the document, with a notion that

the entities on a document may be related to one another through any means, maybe co-authors of some publications.

Park and Kim [82] proposed OnCu System to resolve name sharing problem through ontology-based category utility.

The term category utility is used for similarity measurement between two entities. They exploit two types of ontology:

author ontology, built on the publications from several proceedings of conferences, and the computer science domain

ontology. Different from Hassell et al. [123] they determine the correct author from various candidate authors in the

author ontology by exploiting the domain ontology for estimating the semantic similarity. Their goal is to discover

1 DBWorld. http://www.cs.wisc.edu/dbworld/ April 9, 2006

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 104

the right author of the input publication and his/her right homepage. Their method also differs from that of Hassel et

al. [123] in using ontology-based evaluation functions. OnCU views candidate authors as clusters of their publications

and employs a cluster-based evaluation function exploiting ontology to predict the right author out of multiple

candidate authors. The ontology-based approaches provided better semantic similarity measures for different

attributes, but this is fruitful only if the ontologies providing background knowledge are carefully constructed and

frequently revised to meet the dynamic nature of the digital libraries. Table 9 provides a quick summary of

disambiguation based that utilize the domain ontology.

TABLE 9

SUMMARY OF ONTOLOGY-BASED METHODS

Reference # Problem Tool / Method Attributes /

features

Comparison with Dataset Finding Limitation

Hassell et al.

[123] 2006

Entity

disambiguation

Ontology-driven

method

Background

knowledge

(authors,

conferences, and

journals

)

Different types of

relationships in the

ontology are

exploited

Ontology from

DBLP, corpus

from dbworld

Successful use of

large, populated

ontology

Needs to be

tested on

Robust

platform

Park and Kim

[82] 2008

Name sharing

problem

Oncu, ontology-

based category

utility

Author ontology,

Computer science

domain ontology

Evaluation based

on category

Utility over the

created ambiguity

dataset

Collected papers

from AAAI, ISWC,

ESWC,

And WWW

conferences

websites.

Improved

performance

Cannot

consider

property

Relations

VII. PERFORMANCE EVALUATION

Accuracy, precision, recall and F-measure are the common performance metrics used to evaluate AND methods [29],

[39], [40], [43], [52], [54], [70], [87], [101]. The performance of method used is either measured in terms of the

number of publications correctly predicted or the number of authors correctly predicted. In literature, the performance

measurement terms are defined in a variety of ways. Here we shortly describe the common notion of these terms:

A. ACCURACY

Accuracy (disambiguation accuracy) is the generic term used to represent performance in terms of correctness. It may

be defined in any way that best suits the proposed method. It may be equivalent to precision, recall, and F-measure.

The term accuracy is defined and used by several researchers [37], [42], [51], [57]. For example, Han et al. [51]

defined disambiguation accuracy as “the percentage of the query names correctly predicted”, whereas Han et al. [57]

defined it as “the sum of diagonal elements divided by the total number of elements in the confusion matrix”. Both

these definitions describe the accuracy in terms of correctly predicted authors rather than the correctly predicted

publications of an author.

B. PRECISION

It is the ratio between the number of correctly predicted publications of author ai and the number of publications

predicted as ai’s publications.

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = .   [ ∩]

.   {}− − − − − − − (7)

where, 𝑃 = publications of author ai and 𝑃′ = publications predicted as author ai’s. Suppose author ai has publications

{P1-P5}; and the system predicted publications of author ai are {P1-P4, P6, P7}. By applying Eq. 7:

Precision = 4/6 = 0.67

C. RECALL

It is the ratio between the number of correctly predicted publications of author ai and number of ai’s publications.

𝑅𝑒𝑐𝑎𝑙𝑙 = .   [∩]

.    {}− − − − − − − (8)

where, 𝑃 = Publications of author ai and 𝑃′ = Publications predicted as author ai’s. By considering the above example

using Eq. 8:

Recall = 4/5 = 0.8

D. F-MEASURE

It is the harmonic mean of precision and recall.

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 105

𝐻 = 𝑛

∑1

𝑥





− − − − − − − (9)

By considering the above example using Eq. 9:

F-measure = 



.

.=

(..)= 0.73

The above metrics can also be defined on the cluster level too [58]. Cluster precision is the fraction of correct clusters

to the number of clusters acquired by the method, and cluster recall is the fraction of true clusters to that of the method,

and cluster F-measure is the harmonic mean of both [58].

VIII. FUTURE DIRECTIONS AND RECOMMENDATIONS

Although a lot of research work has been performed in this field yet there is a need for a lot of improvement. Many

attempts have been made to assign a unique author ID to each author to resolve the name disambiguation, but these

methods could not gain the attention of the researchers due to many reasons as we have discussed in Section 2. Many

researchers emphasize exploiting more and more attributes to estimate the maximum similarity among the citations.

This causes two issues: first, the time complexity of the algorithm increases, and resultantly scalability is inversely

affected; second, the availability of numerous features for each citation becomes almost impossible. Besides these

issues assigning weight and fixing threshold values to each feature are the bottleneck, especially when the feature set

becomes large. We recommend exploiting only those features that are usually available in the BD so that a general

framework applicable to most of them can be proposed. To resolve the AND problem in a better way we suggest a

few directions below that may help improve the performance:

1. Semantics play an important role in co-author networks [45]–[47]. WordNet1 captures structured semantics

of words and can be exploited for AND in BD to achieve more accurate results through ontologies [56,97].

We propose to use multi-gram topic models besides the unigrams of words for topics distribution over words.

In this way, the natural syntactic relationship among the words is preserved and author writing habits can

become useful for AND. These suggestions can be useful as they consider semantics and can provide better

similarity estimation among the citations.

2. In literature, the transitivity issue is addressed only for the co-authors attribute. We suggest leveraging this

concept for title and venue attributes too.

3. Instead of simply matching the titles of the publications, the references of the two publications to find the

similarity between the two publications can also be exploited.

4. Most of the methods while handling the venue attribute use only its title. We suggest considering the ranking

of the publication venues too. Based on this ranking, the REsearch Ability Level (REAL) of a researcher can

be estimated. The REAL value may help predict the correct author as authors with the same names might

have different rank research publications. All these measurements help improve similarity metrics.

5. The change of the research domain of an author is common these days due to overlaps between different

fields. We suggest constructing sub-clusters within a cluster associated with a particular author. Each sub-

cluster can differ from those of others based on multiple topics of interest of the author.

6. The advisor-advisee relationship can also be identified first to develop hierarchies for authors. As a result,

the authors who are not the same will become nodes of distinct branches of a tree.

IX. CONCLUSIONS

In this survey, we presented a detailed study of the AND methods for DB. Key challenges are highlighted and a generic

framework is proposed, which is quite intuitive and applicable. A lot of work has been done for name variant and

name sharing problems separately, but few efforts are made to deal with both simultaneously which needs more

attention. Different types of methods, such as supervised, up-supervised, semi-supervised, graph-based, and ontology-

based provided elegant solutions for AND, still, graph-based and ontology-based methods need to be explored

1 http://wordnet.princeton.edu/

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 106

exhaustively. In the end, we have highlighted the major issues and future directions in this field. These future directions

and open challenges can give a quick start to future researchers who are interested to research this area.

In this study, we presented a snapshot of research work done about AND in BD, methods applied, and future

challenges around the time of its writing. However, we do believe that the fundamental information, methods, future

directions, and open challenges presented here will be useful for the researchers in this area of research now and in

the future to get a quick start.

ACKNOWLEDGEMENT

We are grateful to the Higher Education Commission (HEC) of Pakistan for their financial assistance to promote the

research trend in the country under the Indigenous 5000 Fellowship Program.

REFERENCES

[1] T. Amjad et al., “Standing on the shoulders of giants,” J. Informetr., vol. 11, no. 1, Art. no. 1, 2017.

[2] J. C. Chen, J. Z. Shyu, and C.-Y. Huang, “EVALUATING KNOWLEDGE DIFFUSION CAPABILITIES OF HIGHER EDUCATION

INSTITUTES BY USING THE DEA,” Glob. J. Res. Anal., vol. 6, no. 10, Art. no. 10, 2018.

[3] R. H. Gálvez, “Assessing author self-citation as a mechanism of relevant knowledge diffusion,” Scientometrics, vol. 111, no. 3, Art. no. 3,

2017.

[4] T. Amjad and A. Ali, “Uncovering diffusion trends in computer science and physics publications,” Libr. Hi Tech, vol. 37, no. 4, Art. no. 4,

2019.

[5] F. Haneef et al., “Using network science to understand the link between subjects and professions,” Comput. Hum. Behav., vol. 106, p.

106228, 2020.

[6] A. Daud et al., “Finding Rising Stars in Bibliometric Networks,” Scientometrics, pp. 1–20, 2020.

[7] T. Amjad, A. Daud, S. Khan, R. A. Abbasi, and F. Imran, “Prediction of Rising Stars from Pakistani Research Communities,” in 2018 14th

International Conference on Emerging Technologies (ICET), 2018, pp. 1–6.

[8] A. Daud, F. Abbas, T. Amjad, A. A. Alshdadi, and J. S. Alowibdi, “Finding rising stars through hot topics detection,” Future Gener.

Comput. Syst., vol. 115, pp. 798–813, 2021.

[9] T. Amjad, Y. Rehmat, A. Daud, and R. A. Abbasi, “Scientific impact of an author and role of self-citations,” Scientometrics, vol. 122, no.

2, pp. 915–932, 2020.

[10] X. Bai, I. Lee, Z. Ning, A. Tolba, and F. Xia, “The Role of Positive and Negative Citations in Scientific Evaluation,” IEEE Access, vol. 5,

pp. 17607–17617, 2017.

[11] A. Daud, T. Amjad, M. A. Siddiqui, N. R. Aljohani, R. A. Abbasi, and M. A. Aslam, “Correlational analysis of topic specificity and citations

count of publication venues,” Libr. Hi Tech, 2019.

[12] F. González-Sala, J. Osca-Lluch, and J. Haba-Osca, “Are journal and author self-citations a visibility strategy?,” Scientometrics, vol. 119,

no. 3, Art. no. 3, 2019.

[13] M. K. Hayat et al., “Towards Deep Learning Prospects: Insights for Social Media Analytics,” IEEE Access, vol. 7, pp. 36958–36979, 2019.

[14] D. Bunker, S. Stieglitz, C. Ehnis, and A. Sleigh, “Bright ICT: Social Media Analytics for Society and Crisis Management,” in International

Working Conference on Transfer and Diffusion of IT, 2019, pp. 536–552.

[15] Y.-C. Chang, C.-H. Ku, and C.-H. Chen, “Social media analytics: Extracting and visualizing Hilton hotel ratings and reviews from

TripAdvisor,” Int. J. Inf. Manag., vol. 48, pp. 263–279, 2019.

[16] Z. Saeed et al., “What’s happening around the world? A survey and framework on event detection techniques on twitter,” J. Grid Comput.,

vol. 17, no. 2, pp. 279–312, 2019.

[17] M. S. Faisal, A. Daud, A. U. Akram, R. A. Abbasi, N. R. Aljohani, and I. Mehmood, “Expert ranking techniques for online rated forums,”

Comput. Hum. Behav., vol. 100, pp. 168–176, 2019.

[18] N. Nikzad–Khasmakhi, M. A. Balafar, and M. R. Feizi–Derakhshi, “The state-of-the-art in expert recommendation systems,” Eng. Appl.

Artif. Intell., vol. 82, pp. 126–147, 2019.

[19] T. Amjad, A. Daud, and N. R. Aljohani, “Ranking authors in academic social networks: a survey,” Libr. Hi Tech, vol. 36, no. 1, Art. no. 1,

2018.

[20] T. Amjad, A. Daud, D. Che, and A. Akram, “MuICE: Mutual Influence and Citation Exclusivity Author Rank,” Inf. Process. Manag., 2015.

[21] T. Amjad, A. Daud, A. Akram, and F. Muhammed, “Impact of mutual influence while ranking authors in a co-authorship network,” Kuwait

J. Sci., vol. 43, no. 3, 2016, Accessed: Oct. 03, 2016. [Online]. Available: http://journals.ku.edu.kw/kjs/index.php/KJS/article/view/941

[22] T. Amjad and A. Daud, “Indexing of authors according to their domain of expertise,” Malays. J. Libr. Inf. Sci., vol. 22, no. 1, Art. no. 1,

2017.

[23] T. Amjad, Y. Ding, A. Daud, J. Xu, and V. Malic, “Topic-based heterogeneous rank,” Scientometrics, vol. 104, no. 1, Art. no. 1, 2015.

[24] T. Amjad, “Domain-Specific Scientific Impact and its Prediction,” in 2021 International Conference on Artificial Intelligence (ICAI), 2021,

pp. 16–21.

[25] C. Laorden, I. Santos, B. Sanz, G. Alvarez, and P. G. Bringas, “Word sense disambiguation for spam filtering,” Electron. Commer. Res.

Appl., vol. 11, no. 3, pp. 290–298, 2012.

[26] J. Miró-Borrás and P. Bernabeu-Soler, “Text entry in the e-commerce age: two proposals for the severely handicapped,” J. Theor. Appl.

Electron. Commer. Res., vol. 4, no. 1, pp. 101–112, 2009.

[27] D. Chen, X. Li, Y. Liang, and J. Zhang, “A semantic query approach to personalized e-Catalogs service system,” J. Theor. Appl. Electron.

Commer. Res., vol. 5, no. 3, pp. 39–54, 2010.

[28] Y. Song, J. Huang, I. G. Councill, J. Li, and C. L. Giles, “Efficient topic-based unsupervised name disambiguation,” in Proceedings of the

7th ACM/IEEE-CS joint conference on Digital libraries, 2007, pp. 342–351. Accessed: Oct. 07, 2016. [Online]. Available:

http://dl.acm.org/citation.cfm?id=1255243

[29] J. Tang, A. C. Fong, B. Wang, and J. Zhang, “A unified probabilistic framework for name disambiguation in digital library,” IEEE Trans.

Knowl. Data Eng., vol. 24, no. 6, pp. 975–987, 2012.

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 107

[30] M. Ley, “The DBLP computer science bibliography: Evolution, research issues, perspectives,” in International Symposium on String

Processing and Information Retrieval, 2002, pp. 1–10. Accessed: Oct. 06, 2016. [Online]. Available:

http://link.springer.com/chapter/10.1007/3-540-45735-6_1

[31] C. L. Giles, K. D. Bollacker, and S. Lawrence, “CiteSeer: An automatic citation indexing system,” in Proceedings of the third ACM

conference on Digital libraries, 1998, pp. 89–98. Accessed: Oct. 06, 2016. [Online]. Available: http://dl.acm.org/citation.cfm?id=276685

[32] N. R. Smalheiser and V. I. Torvik, “Author name disambiguation,” Annu. Rev. Inf. Sci. Technol., vol. 43, no. 1, pp. 1–43, 2009.

[33] D. K. Sanyal, P. K. Bhowmick, and P. P. Das, “A review of author name disambiguation techniques for the PubMed bibliographic database,”

J. Inf. Sci., vol. 47, no. 2, pp. 227–254, 2021.

[34] C. L. Giles, H. Zha, and H. Han, “Name disambiguation in author citations using a k-way spectral clustering method,” in Proceedings of

the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’05), 2005, pp. 334–343. Accessed: Oct. 06, 2016. [Online]. Available:

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4118563

[35] B. Malin, “Unsupervised name disambiguation via social network similarity,” in Workshop on link analysis, counterterrorism, and security,

2005, vol. 1401, pp. 93–102. Accessed: Oct. 06, 2016. [Online]. Available: http://www.siam.org/meetings/sdm05/sdm-link-

analysis.zip#page=97

[36] X. Yin, J. Han, and S. Y. Philip, “Object distinction: Distinguishing objects with identical names,” in 2007 IEEE 23rd International

Conference on Data Engineering, 2007, pp. 1242–1246. Accessed: Oct. 07, 2016. [Online]. Available:

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4221773

[37] D. Lee, B.-W. On, J. Kang, and S. Park, “Effective and scalable solutions for mixed and split citation problems in digital libraries,” in

Proceedings of the 2nd international workshop on Information quality in information systems, 2005, pp. 69–76. Accessed: Apr. 12, 2016.

[Online]. Available: http://dl.acm.org/citation.cfm?id=1077514

[38] Y. F. Tan, M. Y. Kan, and D. Lee, “Search engine driven author disambiguation,” in Proceedings of the 6th ACM/IEEE-CS joint conference

on Digital libraries, 2006, pp. 314–315. Accessed: Apr. 12, 2016. [Online]. Available: http://dl.acm.org/citation.cfm?id=1141826

[39] I. Bhattacharya and L. Getoor, “A Latent Dirichlet Model for Unsupervised Entity Resolution.,” in SDM, 2006, vol. 5, p. 59. Accessed:

Oct. 06, 2016. [Online]. Available: http://epubs.siam.org/doi/abs/10.1137/1.9781611972764.5

[40] L. Shu, B. Long, and W. Meng, “A latent topic model for complete entity resolution,” in Data Engineering, 2009. ICDE’09. IEEE 25th

International Conference on, 2009, pp. 880–891. Accessed: Apr. 12, 2016. [Online]. Available:

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4812462

[41] C. Galvez and F. Moya-Anegón, “Approximate personal name-matching through finite-state graphs,” J. Am. Soc. Inf. Sci. Technol., vol.

58, no. 13, pp. 1960–1976, 2007.

[42] P. Treeratpituk and C. L. Giles, “Disambiguating authors in academic publications using random forests,” in Proceedings of the 9th

ACM/IEEE-CS joint conference on Digital libraries, 2009, pp. 39–48. Accessed: Oct. 07, 2016. [Online]. Available:

http://dl.acm.org/citation.cfm?id=1555408

[43] X. Fan, J. Wang, X. Pu, L. Zhou, and B. Lv, “On graph-based name disambiguation,” J. Data Inf. Qual. JDIQ, vol. 2, no. 2, p. 10, 2011.

[44] L. K. Branting, “A comparative evaluation of name-matching algorithms,” in Proceedings of the 9th international conference on Artificial

intelligence and law, 2003, pp. 224–232. Accessed: Apr. 12, 2016. [Online]. Available: http://dl.acm.org/citation.cfm?id=1047837

[45] A. Daud, “Using time topic modeling for semantics-based dynamic research interest finding,” Knowl.-Based Syst., vol. 26, pp. 154–163,

2012.

[46] A. Daud and F. Muhammad, “Group topic modeling for academic knowledge discovery,” Appl. Intell., vol. 36, no. 4, Art. no. 4, 2012.

[47] A. Daud, J. Li, L. Zhou, and F. Muhammad, “Temporal expert finding through generalized time topic modeling,” Knowl.-Based Syst., vol.

23, no. 6, Art. no. 6, 2010.

[48] A. Daud, J. Li, L. Zhou, and F. Muhammad, “Knowledge discovery through directed probabilistic topic models: a survey,” Front. Comput.

Sci. China, vol. 4, no. 2, pp. 280–301, 2010.

[49] D. A. Dervos, N. Samaras, G. Evangelidis, J. Hyvärinen, and Y. Asmanidis, “The universal author identifier system (UAI_Sys),” 2006,

Accessed: Oct. 07, 2016. [Online]. Available: http://arizona.openrepository.com/arizona/handle/10150/105755

[50] A. M. Ketchum, “ORCID,” 2014, Accessed: Oct. 07, 2016. [Online]. Available:

http://nnlm.gov/sites/default/files/migrated/file/3f8237005268a231622eafdda40c9a49.pdf

[51] H. Han, L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis, “Two supervised learning approaches for name disambiguation in author citations,”

in Digital Libraries, 2004. Proceedings of the 2004 joint ACM/IEEE conference on, 2004, pp. 296–305. Accessed: Oct. 06, 2016. [Online].

Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1336139

[52] F. Wang, J. Tang, J. Li, and K. Wang, “A constraint-based topic modeling approach for name disambiguation,” Front. Comput. Sci. China,

vol. 4, no. 1, pp. 100–111, 2010.

[53] B.-W. On, D. Lee, J. Kang, and P. Mitra, “Comparative study of name disambiguation problem using a scalable blocking-based framework,”

in Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries, 2005, pp. 344–353. Accessed: Oct. 06, 2016. [Online].

Available: http://dl.acm.org/citation.cfm?id=1065463

[54] D. Zhang, J. Tang, J. Li, and K. Wang, “A constraint-based probabilistic framework for name disambiguation,” in Proceedings of the

sixteenth ACM conference on Conference on information and knowledge management, 2007, pp. 1019–1022. Accessed: Oct. 07, 2016.

[Online]. Available: http://dl.acm.org/citation.cfm?id=1321600

[55] V. I. Torvik, M. Weeber, D. R. Swanson, and N. R. Smalheiser, “A probabilistic similarity metric for Medline records: A model for author

name disambiguation,” J. Am. Soc. Inf. Sci. Technol., vol. 56, no. 2, pp. 140–158, 2005.

[56] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, 2003.

[57] H. Han, W. Xu, H. Zha, and C. L. Giles, “A hierarchical naive Bayes mixture model for name disambiguation in author citations,” in

Proceedings of the 2005 ACM symposium on Applied computing, 2005, pp. 1065–1069. Accessed: Oct. 06, 2016. [Online]. Available:

http://dl.acm.org/citation.cfm?id=1066920

[58] A. A. Ferreira, A. Veloso, M. A. Gonçalves, and A. H. Laender, “Effective self-training author name disambiguation in scholarly digital

libraries,” in Proceedings of the 10th annual joint conference on Digital libraries, 2010, pp. 39–48. Accessed: Oct. 07, 2016. [Online].

Available: http://dl.acm.org/citation.cfm?id=1816130

[59] B.-W. On and D. Lee, “Scalable Name Disambiguation using Multi-level Graph Partition.,” in SDM, 2007, pp. 575–580. Accessed: Oct.

07, 2016. [Online]. Available: http://epubs.siam.org/doi/abs/10.1137/1.9781611972771.64

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 108

[60] K.-H. Yang, H.-T. Peng, J.-Y. Jiang, H.-M. Lee, and J.-M. Ho, “Author name disambiguation for citations using topic and web correlation,”

in International Conference on Theory and Practice of Digital Libraries, 2008, pp. 185–196. Accessed: Oct. 07, 2016. [Online]. Available:

http://link.springer.com/chapter/10.1007/978-3-540-87599-4_19

[61] P. Reuther, “Personal name matching: New test collections and a social network based approach,” Comput. Sci. Tech. Rep., pp. 06–01,

2006.

[62] S. Pandit, S. Gupta, and others, “A comparative study on distance measuring approaches for clustering,” Int. J. Res. Comput. Sci., vol. 2,

no. 1, pp. 29–31, 2011.

[63] W. Cohen, P. Ravikumar, and S. Fienberg, “A comparison of string metrics for matching names and records,” in Kdd workshop on data

cleaning and object consolidation, 2003, vol. 3, pp. 73–78. Accessed: Oct. 07, 2016. [Online]. Available:

https://www.cs.cmu.edu/afs/cs/Web/People/wcohen/postscript/kdd-2003-match-ws.pdf

[64] A. E. Monge, C. Elkan, and others, “The Field Matching Problem: Algorithms and Applications.,” in KDD, 1996, pp. 267–270. Accessed:

Oct. 07, 2016. [Online]. Available: http://www.aaai.org/Papers/KDD/1996/KDD96-044.pdf

[65] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison, Biological sequence analysis: probabilistic models of proteins and nucleic acids.

Cambridge university press, 1998.

[66] M. A. Jaro, “Probabilistic linkage of large public health data files,” Stat. Med., vol. 14, no. 5–7, pp. 491–498, 1995.

[67] W. E. Winkler, “The state of record linkage and current research problems,” 1999. Accessed: Oct. 07, 2016. [Online]. Available:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.39.4336

[68] Y. Chen and J. Martin, “Towards Robust Unsupervised Personal Name Disambiguation.,” in EMNLP-CoNLL, 2007, pp. 190–198.

Accessed: Oct. 07, 2016. [Online]. Available:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.228.7132&rep=rep1&type=pdf#page=224

[69] L. Jin, C. Li, and S. Mehrotra, “Efficient record linkage in large data sets,” in Database Systems for Advanced Applications, 2003.(DASFAA

2003). Proceedings. Eighth International Conference on, 2003, pp. 137–146. Accessed: Oct. 07, 2016. [Online]. Available:

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1192377

[70] R. Bekkerman and A. McCallum, “Disambiguating web appearances of people in a social network,” in Proceedings of the 14th international

conference on World Wide Web, 2005, pp. 463–470. Accessed: Oct. 06, 2016. [Online]. Available:

http://dl.acm.org/citation.cfm?id=1060813

[71] B.-W. On, E. Elmacioglu, D. Lee, J. Kang, and J. Pei, “Improving grouped-entity resolution using quasi-cliques,” in Sixth International

Conference on Data Mining (ICDM’06), 2006, pp. 1008–1015. Accessed: Oct. 07, 2016. [Online]. Available:

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4053144

[72] G. Salton, A. Wong, and C.-S. Yang, “A vector space model for automatic indexing,” Commun. ACM, vol. 18, no. 11, pp. 613–620, 1975.

[73] T. Hofmann, “Probabilistic latent semantic analysis,” in Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence,

1999, pp. 289–296. Accessed: Oct. 07, 2016. [Online]. Available: http://dl.acm.org/citation.cfm?id=2073829

[74] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proc. Natl. Acad. Sci., vol. 101, no. suppl 1, pp. 5228–5235, 2004.

[75] M. A. Hernández and S. J. Stolfo, “The merge/purge problem for large databases,” in ACM Sigmod Record, 1995, vol. 24, pp. 127–138.

Accessed: Oct. 06, 2016. [Online]. Available: http://dl.acm.org/citation.cfm?id=223807

[76] H. L. Dunn, “Record linkage*,” Am. J. Public Health Nations Health, vol. 36, no. 12, pp. 1412–1416, 1946.

[77] D. Bitton and D. J. DeWitt, “Duplicate record elimination in large data files,” ACM Trans. Database Syst. TODS, vol. 8, no. 2, pp. 255–

265, 1983.

[78] K. J. Cios, R. W. Swiniarski, W. Pedrycz, and L. A. Kurgan, “The knowledge discovery process,” in Data Mining, 2007, pp. 9–24. Accessed:

Oct. 06, 2016. [Online]. Available: http://link.springer.com/content/pdf/10.1007/978-0-387-36795-8_2.pdf

[79] W. W. Cohen, H. Kautz, and D. McAllester, “Hardening soft information sources,” in Proceedings of the sixth ACM SIGKDD international

conference on Knowledge discovery and data mining, 2000, pp. 255–259. Accessed: Oct. 06, 2016. [Online]. Available:

http://dl.acm.org/citation.cfm?id=347141

[80] A. Bagga, Coreference, cross-document coreference, and information extraction methodologies. Duke University, 1998. Accessed: Oct.

06, 2016. [Online]. Available: http://dl.acm.org/citation.cfm?id=927251

[81] H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser, “Identity uncertainty and citation matching,” in Advances in neural information

processing systems, 2002, pp. 1401–1408. Accessed: Oct. 06, 2016. [Online]. Available:

http://machinelearning.wustl.edu/mlpapers/paper_files/AP01.pdf

[82] Y.-T. Park and J.-M. Kim, “OnCU system: ontology-based category utility approach for author name disambiguation,” in Proceedings of

the 2nd international conference on Ubiquitous information management and communication, 2008, pp. 63–68. Accessed: Oct. 07, 2016.

[Online]. Available: http://dl.acm.org/citation.cfm?id=1352807

[83] C. L. Scoville, E. D. Johnson, and A. L. McConnell, “When A. Rose is not A. Rose: the vagaries of author searching,” Med. Ref. Serv. Q.,

vol. 22, no. 4, pp. 1–11, 2003.

[84] A. Culotta, P. Kanani, R. Hall, M. Wick, and A. McCallum, “Author disambiguation using error-driven machine learning with a ranking

loss function,” 2007. Accessed: Oct. 07, 2016. [Online]. Available: http://www.aaai.org/Papers/Workshops/2007/WS-07-14/WS07-14-

006.pdf

[85] V. I. Torvik and N. R. Smalheiser, “Author name disambiguation in MEDLINE,” ACM Trans. Knowl. Discov. Data TKDD, vol. 3, no. 3,

p. 11, 2009.

[86] Y. Qian, Y. Hu, J. Cui, Q. Zheng, and Z. Nie, “Combining machine learning and human judgment in author disambiguation,” in Proceedings

of the 20th ACM international conference on Information and knowledge management, 2011, pp. 1241–1246. Accessed: Oct. 07, 2016.

[Online]. Available: http://dl.acm.org/citation.cfm?id=2063756

[87] X. Wang, J. Tang, H. Cheng, and S. Y. Philip, “Adana: Active name disambiguation,” in 2011 IEEE 11th International Conference on

Data Mining, 2011, pp. 794–803. Accessed: Oct. 07, 2016. [Online]. Available:

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6137284

[88] V. Vapnik, The nature of statistical learning theory. Springer Science & Business Media, 2013. Accessed: Oct. 06, 2016. [Online].

Available:

https://books.google.com.pk/books?hl=en&lr=&id=EqgACAAAQBAJ&oi=fnd&pg=PR7&dq=The+Nature+of+Statistical+Learning+Th

eory&ots=g2K2mycZ25&sig=ypjXo0ldi0UPQbOmCE-sfUJn4fk

[89] N. Cristianini and J. Shawe-Taylor, An introduction to support vector machines and other kernel-based learning methods. Cambridge

university press, 2000. Accessed: Oct. 06, 2016. [Online]. Available:

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 109

https://books.google.com.pk/books?hl=en&lr=&id=_PXJn_cxv0AC&oi=fnd&pg=PR9&dq=An+Introduction+to+Support+Vector+Mach

ines&ots=xRTi9F3u29&sig=qAt8_84DBLerD35-wG1RmkhA_18

[90] H. Han, H. Zha, and C. L. Giles, “A model-based k-means algorithm for name disambiguation,” 2003. Accessed: Oct. 07, 2016. [Online].

Available: http://ceur-ws.org/Vol-83/int_2.pdf

[91] X. Sun, J. Kaur, L. Possamai, and F. Menczer, “Detecting ambiguous author names in crowdsourced scholarly data,” in Privacy, Security,

Risk and Trust (PASSAT) and 2011 IEEE Third Inernational Conference on Social Computing (SocialCom), 2011 IEEE Third International

Conference on, 2011, pp. 568–571. Accessed: Oct. 07, 2016. [Online]. Available:

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6113170

[92] D. T. Hoang, J. Kaur, and F. Menczer, “Crowdsourcing scholarly data,” 2010, Accessed: Oct. 07, 2016. [Online]. Available:

http://journal.webscience.org/321/2/websci10_submission_107.pdf

[93] B. Zhang, M. Dundar, and M. A. Hasan, “Bayesian Non-Exhaustive Classification A Case Study: Online Name Disambiguation using

Temporal Record Streams,” ArXiv Prepr. ArXiv160705746, 2016, Accessed: Oct. 07, 2016. [Online]. Available:

http://arxiv.org/abs/1607.05746

[94] E. Elmacioglu, J. Kang, D. Lee, J. Pei, and B. On, “An effective approach to entity resolution problem using quasi-clique and its application

to digital libraries,” in Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’06), 2006, pp. 51–52. Accessed:

Oct. 07, 2016. [Online]. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4119096

[95] I. Bhattacharya and L. Getoor, “Collective entity resolution in relational data,” ACM Trans. Knowl. Discov. Data TKDD, vol. 1, no. 1, p.

5, 2007.

[96] R. G. Cota, M. A. Gonçalves, and A. H. Laender, “A Heuristic-based Hierarchical Clustering Method for Author Name Disambiguation in

Digital Libraries.,” in SBBD, 2007, pp. 20–34. Accessed: Oct. 07, 2016. [Online]. Available:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.5709&rep=rep1&type=pdf

[97] J. Soler, “Separating the articles of authors with the same name,” Scientometrics, vol. 72, no. 2, pp. 281–290, 2007.

[98] I.-S. Kang et al., “On co-authorship for author disambiguation,” Inf. Process. Manag., vol. 45, no. 1, pp. 84–97, 2009.

[99] D. A. Pereira, B. Ribeiro-Neto, N. Ziviani, A. H. Laender, M. A. Gonçalves, and A. A. Ferreira, “Using web information for author name

disambiguation,” in Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, 2009, pp. 49–58. Accessed: Oct. 07, 2016.

[Online]. Available: http://dl.acm.org/citation.cfm?id=1555409

[100] A. E. Gelfand, “Gibbs sampling,” J. Am. Stat. Assoc., vol. 95, no. 452, pp. 1300–1304, 2000.

[101] D. Shin, T. Kim, H. Jung, and J. Choi, “Automatic method for author name disambiguation using social networks,” in 2010 24th IEEE

International Conference on Advanced Information Networking and Applications, 2010, pp. 1263–1270. Accessed: Oct. 07, 2016. [Online].

Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5474861

[102] D. Shin, J. Kang, J. Choi, and J. Yang, “Detecting collaborative fields using social networks,” in Networked Computing and Advanced

Information Management, 2008. NCM’08. Fourth International Conference on, 2008, vol. 1, pp. 325–328. Accessed: Oct. 07, 2016.

[Online]. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4624027

[103] K.-H. Yang and Y.-H. Wu, “Author name disambiguation in citations,” in Proceedings of the 2011 IEEE/WIC/ACM International

Conferences on Web Intelligence and Intelligent Agent Technology-Volume 03, 2011, pp. 335–338. Accessed: Oct. 07, 2016. [Online].

Available: http://dl.acm.org/citation.cfm?id=2052298

[104] H. Künsch, S. Geman, and A. Kehagias, “Hidden Markov random fields,” Ann. Appl. Probab., pp. 577–602, 1995.

[105] H. Wu, B. Li, Y. Pei, and J. He, “Unsupervised author disambiguation using Dempster–Shafer theory,” Scientometrics, vol. 101, no. 3, pp.

1955–1972, 2014.

[106] Y. Qian, Q. Zheng, T. Sakai, J. Ye, and J. Liu, “Dynamic author name disambiguation for growing digital libraries,” Inf. Retr. J., vol. 18,

no. 5, pp. 379–412, 2015.

[107] M. Khabsa, P. Treeratpituk, and C. L. Giles, “Online person name disambiguation with constraints,” in Proceedings of the 15th ACM/IEEE-

CS Joint Conference on Digital Libraries, 2015, pp. 37–46. Accessed: Oct. 08, 2016. [Online]. Available:

http://dl.acm.org/citation.cfm?id=2756915

[108] C.-C. Sun, D.-R. Shen, Y. Kou, T.-Z. Nie, and G. Yu, “Topological Features Based Entity Disambiguation,” J. Comput. Sci. Technol., vol.

31, no. 5, pp. 1053–1068, 2016.

[109] J. Huang, S. Ertekin, and C. L. Giles, “Efficient name disambiguation for large-scale databases,” in European Conference on Principles of

Data Mining and Knowledge Discovery, 2006, pp. 536–544. Accessed: Oct. 07, 2016. [Online]. Available:

http://link.springer.com/10.1007%2F11871637_53

[110] M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg, “Adaptive name matching in information integration,” IEEE Intell.

Syst., vol. 18, no. 5, pp. 16–23, 2003.

[111] F. Wang, J. Li, J. Tang, J. Zhang, and K. Wang, “Name disambiguation using atomic clusters,” in Web-Age Information Management,

2008. WAIM’08. The Ninth International Conference on, 2008, pp. 357–364. Accessed: Oct. 07, 2016. [Online]. Available:

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4597035

[112] T. Arif, R. Ali, and M. Asger, “Author name disambiguation using vector space model and hybrid similarity measures,” in Contemporary

Computing (IC3), 2014 Seventh International Conference on, 2014, pp. 135–140. Accessed: Oct. 07, 2016. [Online]. Available:

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6897162

[113] D. V. Kalashnikov and S. Mehrotra, “Domain-independent data cleaning via analysis of entity-relationship graph,” ACM Trans. Database

Syst. TODS, vol. 31, no. 2, pp. 716–767, 2006.

[114] Z. Chen, D. V. Kalashnikov, and S. Mehrotra, “Adaptive graphical approach to entity resolution,” in Proceedings of the 7th ACM/IEEE-

CS joint conference on Digital libraries, 2007, pp. 204–213. Accessed: Oct. 07, 2016. [Online]. Available:

http://dl.acm.org/citation.cfm?id=1255215

[115] H. Jin, L. Huang, and P. Yuan, “Name disambiguation using semantic association clustering,” in e-Business Engineering, 2009. ICEBE’09.

IEEE International Conference on, 2009, pp. 42–48. Accessed: Oct. 07, 2016. [Online]. Available:

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5342132

[116] J. Pei, D. Jiang, and A. Zhang, “On mining cross-graph quasi-cliques,” in Proceedings of the eleventh ACM SIGKDD international

conference on Knowledge discovery in data mining, 2005, pp. 228–238. Accessed: Oct. 07, 2016. [Online]. Available:

http://dl.acm.org/citation.cfm?id=1081898

Author Name Disambiguation in Bibliographic Databases: A Survey

Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 110

[117] D. M. McRae-Spencer and N. R. Shadbolt, “Also by the same author: AKTiveAuthor, a citation graph approach to name disambiguation,”

in Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, 2006, pp. 53–54. Accessed: Oct. 07, 2016. [Online].

Available: http://dl.acm.org/citation.cfm?id=1141762

[118] D. Shin, T. Kim, J. Choi, and J. Kim, “Author name disambiguation using a graph model with node splitting and merging based on

bibliographic information,” Scientometrics, vol. 100, no. 1, pp. 15–50, 2014.

[119] J. Kleb and R. Volz, “Ontology based entity disambiguation with natural language patterns,” in Digital Information Management, 2009.

ICDIM 2009. Fourth International Conference on, 2009, pp. 1–8. Accessed: Oct. 07, 2016. [Online]. Available:

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5356769

[120] M. Yankova, H. Saggion, and H. Cunningham, “Adopting ontologies for multisource identity resolution,” in Proceedings of the first

international workshop on Ontology-supported business intelligence, 2008, p. 6. Accessed: Oct. 07, 2016. [Online]. Available:

http://dl.acm.org/citation.cfm?id=1452573

[121] H. T. Nguyen and T. H. Cao, “Named entity disambiguation on an ontology enriched by Wikipedia,” in Research, Innovation and Vision

for the Future, 2008. RIVF 2008. IEEE International Conference on, 2008, pp. 247–254. Accessed: Oct. 09, 2016. [Online]. Available:

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4586363

[122] H. T. Nguyen and T. H. Cao, “Enriching ontologies for named entity disambiguation,” 2010. Accessed: Oct. 07, 2016. [Online]. Available:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.473.2198&rep=rep1&type=pdf

[123] J. Hassell, B. Aleman-Meza, and I. B. Arpinar, “Ontology-driven automatic entity disambiguation in unstructured text,” in International

Semantic Web Conference, 2006, pp. 44–57. Accessed: Oct. 07, 2016. [Online]. Available:

http://link.springer.com/10.1007%2F11926078_4

ResearchGate has not been able to resolve any citations for this publication.

Finding Rising Stars in Bibliometric Networks: a Survey

Article

Full-text available

Apr 2020

Finding Rising Stars (FRS) is a hot research topic investigated recently for diverse application domains. These days, people are more interested in finding people who will become experts shortly to fill junior positions than finding existing experts who can immediately fill senior positions. FRS can increase productivity wherever they join due to their vibrant and energetic behavior. In this paper, we assess the methods to find FRS. The existing methods are classified into ranking-, prediction-, clustering-, and analysis-based methods, and the pros and cons of these methods are discussed. Details of standard datasets and performance-evaluation measures are also provided for this growing area of research. We conclude by discussing open challenges and future directions in this prosperous area of research.

Scientific impact of an author and role of self-citations

Article

Full-text available

Dec 2019

In bibliometric and scientometric research, the quantitative assessment of scientific impact has boomed over the past few decades. Citations, being playing a major role in enhancing the impact of researchers, have become a very significant part of a plethora of new techniques for measuring scientific impact. Self-citations, though can be used genuinely to credit someone’s own work, can play a significant role in artificial manipulation of scientific impact. In this research, we study the impact of self-citations on enhancing the scientific impact of an author using a dataset retrieved from AMiner ranging from 1936 to 2014 from the computer science domain. We investigated the relations among trends of self-citation and their influence on scientific impact. We also studied its influence on ranking metrics including author impact factor and H-Index. By analyzing self-citations over time, we discover five basic self-citation trends, which are early, middle, later, multi and none. Distinctly different patterns were observed in self-citations trends. The results show that self-citations, if totally removed from total received citations, negatively influence the AIF and H-Index values and hence can be used to artificially boost the scientific impact. We used regression-based prediction models to predict the influence of self-citations on future H-Index. Classifiers including Logistic Regression, Naïve Bayes and K-NN were used with an accuracy of 93%, 73% and 60% respectively.

Twitter: A Survey and Framework on Event Detection Techniques

Article

Full-text available

Jun 2019

In the last few years, Twitter has become a popular platform for sharing opinions, experiences, news, and views in real-time. Twitter presents an interesting opportunity for detecting events happening around the world. The content (tweets) published on Twitter are short and pose diverse challenges for detecting and interpreting event-related information. This article provides insights into ongoing research and helps in understanding recent research trends and techniques used for event detection using Twitter data. We classify techniques and methodologies according to event types, orientation of content, event detection tasks, their evaluation, and common practices. We highlight the limitations of existing techniques and accordingly propose solutions to address the shortcomings. We propose a framework called EDoT based on the research trends, common practices, and techniques used for detecting events on Twitter. EDoT can serve as a guideline for developing event detection methods, especially for researchers who are new in this area. We also describe and compare data collection techniques, the effectiveness and shortcomings of various Twitter and non-Twitter-based features, and discuss various evaluation measures and benchmarking methodologies. Finally, we discuss the trends, limitations, and future directions for detecting events on Twitter.

The state-of-the-art in expert recommendation systems

Article

Full-text available

Jun 2019
ENG APPL ARTIF INTEL

The recent rapid growth of the Internet content has led to building recommendation systems that guide users to their needs through an information retrieving process. An expert recommendation system is an emerging area that attempts to detect the most knowledgeable people in some specific topics. This detection is based on both the extracted information from peoples’ activities and the content of the documents concerned with them. Moreover, an expert recommendation system takes a user topic or query and then provides a list of people sorted by the degree of their relevant expertise with the given topic or query. These systems can be modeled by information retrieval approaches, along with search engines or a combination of natural language processing systems. The following study provides a critical overview of existing expert recommendation systems and their advantages and disadvantages, considering as well different techniques employed by them.

Prediction of Rising Stars from Pakistani Research Communities

Conference Paper

Full-text available

Nov 2018

Correlational analysis of topic specificity and citations count of publication venues

Article

Full-text available

Oct 2018
LIBR HI TECH

Purpose Citation analysis is an important measure for the assessment of quality and impact of academic entities (authors, papers and publication venues) used for ranking of research articles, authors and publication venues. It is a common observation that high-level publication venues, with few exceptions ( Nature , Science and PLOS ONE ), are usually topic specific. The purpose of this paper is to investigate the claim correlation analysis between topic specificity and citation count of different types of publication venues (journals, conferences and workshops). Design/methodology/approach The topic specificity was calculated using the information theoretic measure of entropy (which tells us about the disorder of the system). The authors computed the entropy of the titles of the papers published in each venue type to investigate their topic specificity. Findings It was observed that venues usually with higher citations (high-level publication venues) have low entropy and venues with lesser citations (not-high-level publication venues) have high entropy. Low entropy means less disorder and more specific to topic and vice versa. The input data considered here were DBLP-V7 data set for the last 10 years. Experimental analysis shows that topic specificity and citation count of publication venues are negatively correlated to each other. Originality/value This paper is the first attempt to discover correlation between topic sensitivity and citation counts of publication venues. It also used topic specificity as a feature to rank academic entities.

Using network science to understand the link between subjects and professions

Article

Dec 2019
COMPUT HUM BEHAV

Uncovering diffusion trends in computer science and physics publications

Article

Jun 2019
LIBR HI TECH

Purpose The purpose of this paper is to trace the knowledge diffusion patterns between the publications of top journals of computer science and physics to uncover the knowledge diffusion trends. Design/methodology/approach The degree of information flow between the disciplines is a measure of entropy and received citations. The entropy gives the uncertainty in the citation distribution of a journal; the more a journal is involved in spreading information or affected by other journals, its entropy increases. The citations from outside category give the degree of inter-disciplinarity index as the percentage of references made to papers of another discipline. In this study, the topic-related diffusion across computer science and physics scholarly communication network is studied to examine how the same research topic is studied and shared across disciplines. Findings For three indicators, Shannon entropy, citations outside category (COC) and research keywords, a global view of information flow at the journal level between both disciplines is obtained. It is observed that computer science mostly cites knowledge published in physics journals as compared to physics journals that cite knowledge within the field. Originality/value To the best of the authors’ knowledge, this is the first study that traces knowledge diffusion trends between computer science and physics publications at journal level using entropy, COC and research keywords.

Are journal and author self-citations a visibility strategy?

Article

Apr 2019

This study is aimed at analysing self-citation as a strategy used by journals and authors regarding first citations in of Latin-American psychology journals between 2012 and 2016. A total of 8977 citations received were analysed for a total of 2403 papers published in the 19 Latin-American psychology journals collected in the 2016 WoS (included in the 2015 JCR edition). The results indicate that there is an effect of the first self-citations on the number of citations, the journal self-citations and the author’s. It is observed that the journal self-citations and first journal self-citations are more important for the journals located in first quartiles, versus author’s self-citations. The importance of the type of self-citation differs between some publications and others, being the journal self-citations those that greater differences present between journals throughout the period studied. The self-consumption of information, according to the number of articles with self-citations, varies between the journals, ranging between 88.8 and 55.8%. It can be concluded that self-citations and first self-citations play an important role in the citation of the works and in the increase of their visibility.

Expert ranking techniques for online rated forums

Article

Jul 2018
COMPUT HUM BEHAV

Web 2.0 or social web applications such as online discussion forums, blogs and Wikipedia have improved knowledge sharing by providing an environment in which users can generate and find their favorite content in, a flexible way. With the passage of time, online discussion forums accumulate a huge amount of content and this can introduce issues of content quality and user credibility. A poor-quality answer in a discussion forum indicates the presence of unprofessional or unqualified users; therefore, a priority is to find experts or reputable users. Most of the existing expert-ranking approaches consider basic features, such as the total number of answers provided by a user, but ignore the quality and consistency of the user's answer. In this paper, expert-ranking techniques using g-index are proposed, and are applied to a StackOverflow forum dataset. Three techniques are proposed including Exp-PC, Rep-FS and Weighted Exp-PC. Exp-PC is an adaptation of g-index for ranking experts in StackOverflow forum. In Rep-FS, several features like voters reputation, vote ratio are proposed to measure users' expertise while Weighted Exp-PC computes user expertise by combining their Exp-PC and Rep-FS scores. We measure users' reputation and expertise according to both the quality of their answer and their consistency in providing quality answers. The experimental results of the proposed expert-ranking techniques, Exp-PC and Weighted Exp-PC in particular, validate that these methods identify genuine experts in a more effective way.

Author Name Disambiguation in Bibliographic Databases: a Survey

Abstract and Figures

Recommended publications

Author Name Disambiguation in Bibliographic Databases: A Survey

Improving Similarity Measures for Publications with Special Focus on Author Name Disambiguation

Metadata-based Author Name Disambiguation

Scholar search-oriented author disambiguation