PreprintPDF Available

Author Name Disambiguation in Bibliographic Databases: a Survey

Authors:
  • Rabdan Academy
  • Northeastern Univeristy
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Entity resolution is a challenging and hot research area in the field of Information Systems since last decade. Author Name Disambiguation (AND) in Bibliographic Databases (BD) like DBLP , Citeseer , and Scopus is a specialized field of entity resolution. Given many citations of underlying authors, the AND task is to find which citations belong to the same author. In this survey, we start with three basic AND problems, followed by need for solution and challenges. A generic, five-step framework is provided for handling AND issues. These steps are; (1) Preparation of dataset (2) Selection of publication attributes (3) Selection of similarity metrics (4) Selection of models and (5) Clustering Performance evaluation. Categorization and elaboration of similarity metrics and methods are also provided. Finally, future directions and recommendations are given for this dynamic area of research.
Content may be subject to copyright.
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 87
Received: 1 May 2020; Revised: 8 Dec 2021; Accepted: 10 Dec 2021; Published: 26 Dec 2021
Journal Name [Online ISSN Coming Soon], Volume 2, Issue 1, Article 9, Pages 87-110, December 2021
Digital Object Identifier 10.1111/RpJC.2020.DOINumber
Author Name Disambiguation in Bibliographic
Databases: A Survey
Muhammad Shoaib
a
, Ali Daud
b
, Tehmina Amjad
c*
a
Department of Computer Science, Comsats University, Sahiwal Campus, Pakistan
b
Department of Information Systems and Technology, College of Computer Science and Engineering, University of
Jeddah, Saudi Arabia
c
Department of Computer Science and Software Engineering, International Islamic University, Islamabad, Pakistan
Corresponding author: Tehmina Amjad (tehminaamjad@iiu.edu.pk)
ABSTRACT Entity resolution is a challenging and hot research area in the field of Information Systems for the last
decade. Author name disambiguation in bibliographic databases like DBLP
1,
Citeseer
2
, and Scopus
3
is a specialized
field of entity resolution. Given many citations of underlying authors, the author name disambiguation task is to find
which citations belong to the same author. In this survey, we start with three basic author name disambiguation
problems, followed by a need for solutions and challenges. A generic, five-step framework is provided for handling
author name disambiguation issues. These steps are preparation of dataset, selection of publication attributes, selection
of similarity metrics, selection of models, and performance evaluation of clustering. Categorization and elaboration
of similarity metrics and methods are also provided. Finally, future directions and recommendations are given for this
dynamic area of research.
Keywords Author Name Disambiguation, Bibliographic Databases, Entity Resolution, Metrics, Similarity.
I. INTRODUCTION
The scholarly societies that are constituted via bibliometric networks are growing with progress in scientific
research[1–4]. The network science methods cover several aspects of the study of evolving sciences like the
relationship between professions and their careers [5], finding the emerging stars from scholarly networks [6–8], the
study of citation networks [9–12], social media analytics [13–16], expert ranking methods [17]–[19]. The problem of
entity resolution has attracted the attention of information system researchers for a long time now. Author Name
Disambiguation (AND) in Bibliographic Databases (BD) is a hot issue and is a specialized field of entity resolution.
Author name disambiguation is the process of distinguishing authors with similar names from each other. The
bibliographic databases include a large amount of data from co-author networks and digital libraries. Authors or
researchers can have similar names, can have multiple ways of writing their full names, or different authors can share
multiple names. These situations arise the ambiguity for the methods that need the publication metadata for ranking
or evaluating the authors [20–24]. The disambiguation methods are not only required in co-author networks but are
also significant in fields like spam filtering [25–27]. Search engines like Google
4
facilitate the users in searching web
pages automatically. The name queries are approximately 5-10% of all queries [28]. Further, it is estimated that the
300 most common male names are used by more than 114 million people in the United States [29]. Search engines
usually treat the name queries as normal keyword searches and do not pay any special attention towards their possible
ambiguity. For example, when searching for Tehmina Amjad on Google, it shows 228,000 web pages containing
similar names. Out of these pages, only a small portion is relevant to the intended Tehmina Amjad. This is because
the data on the internet is heterogeneous.
1
http://www.informatik.uni-trier.de/~ley/db/
2
http://citeseer.ist.psu.edu/
3
http://www.scopus.com/home.url
4
http:// www.google.com
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 88
In BD, it is necessary to uniquely identify the work of one researcher from another, and this process is known as AND.
Formally, a bibliographic database is an organized digital store of citations to research publications, patents, books,
and news articles. It stores the metadata of the publications. Examples of commonly used BD are DBLP [30], CiteSeer
[31], MEDLINE1, and Google Scholar2. An AND method that best fits a bibliographic dataset may not be suitable for
other datasets. The reason behind this is that they differ in their metadata schema. Most of the methods fall in either
supervised learning or unsupervised learning or a combination of the two.
Smalheiser and Torvik [32] have provided a detailed literature survey of methods for AND but their work has many
shortcomings, such as a general framework is not provided, similarity metrics and methods are not explained category-
wise in detail. a comprehensive survey of the existing author name disambiguation (AND) approaches that have been
applied to the PubMed database by Sanyal et al. [33]. The authors classify the approaches into a taxonomy and describe
the key characteristics of each approach, such as its performance, strengths, and weaknesses. They have also identified
the PubMed datasets that are publicly available for researchers to evaluate AND algorithms.
Our contributions in this work are as follows
(1) Proposal of a general framework for AND
(2) Categorization and elaboration of similarity metrics which are the main focus of researchers in AND to
find the resemblance among citations and
(3) Categorization of methods used to handle AND task into five types with the elaboration of works falling
under each category in chronological order.
The rest of the paper is organized as follows. Section 2 describes AND tasks and related concepts. Section 3 provides
a general framework based on most of the methods used in the past. Section 4 is about the commonly used datasets to
perform AND. Section 5 is about the similarity estimation metrics. Section 6 categorizes the methods employed for
AND and explains categories in chronological order. Section 7 explains how to compare different methods and some
future directions and recommendations are suggested in section 8. Finally, section 9 concludes this paper.
II. AUTHOR NAME DISAMBIGUATION IN BIBLIOGRAPHIC DATABASES (ANDBD)
Resolving the name ambiguity in Bibliographic Databases is called ANDBD. In literature many terms are used for this
problem like name disambiguation [34], [35], object distinction [36], mixed and split citation [37], author
disambiguation [38] and entity resolution [39], [40]. ANDBD problems can be divided into three categories. Before
discussing ANDBD problem categories through intuitive examples, some related basic concepts are provided.
Publication: A publication means the research work/article/paper of an author or group of authors working together
published at any venue (journal, conference, or workshop).
Citations: The number of times a publication is cited/referenced by other publications.
References: It is the list of references given at the end of a publication.
Ambiguous Author name(s): A name that is either shared by multiple authors or multiple variant names of a single
author. Let A be the ambiguous author name shared by k number of unique authors, say, a1, a2,… , ak. Further let ai is
an author represented by m number of various names, say, n1, n2,…, nm. In this article, we use “ambiguous author
name”, “ambiguous author” and “ambiguous name”, interchangeably.
A. Problem Categories
1) SYNONYMY/NAME VARIANT PROBLEM
The problem of Synonymy arises when an author has variations or abbreviations in his/her name in the citations. For
example, the author name “Malik Sikandar Hayat Khiyal” is also written as “Sikandar Hayat” in citations of the
publications. The DBLP treats them as two different authors and divides his publications between two names. In
literature, this problem is also referred to as name variant problem [40], [41], entity resolution problem [39], split
citation problem [37] and aliasing problem [42].
2) POLYSEMY/NAME SHARING PROBLEM
The problem of Polysemy arises when multiple authors share the same name label in multiple citations. For example,
“Guilin Chen” and “Guangyu Chen” write their names as “G. Chen” in their publications. A full name of an author
may be shared by multiple authors. Bibliographic databases may treat these different authors as a single author.
Resultantly, on querying the database for such ambiguous names, it may list all publications under a single person’s
name. On querying DBLP against the author name “Michael Johnson” it lists 32 publications that are actually from
five different people [40]. In literature there are various names of this problem such as name disambiguation [34],
1 www.ncbi.nlm.nih.gov
2 scholar.google.com
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 89
[35], [43], object distinction [36], mixed citation [37], author disambiguation [38] and the common name problem
[40].
3) NAME MIXING PROBLEM
Shu et al. [40] introduced another type of name disambiguation problem and referred to it as a name mixing problem.
If multiple persons share multiple names, it is called the name mixing problem. The two problems discussed above
may occur simultaneously and cause the name mixing problem.
Typographical mistakes also cause name ambiguity. Treeratpituk and Giles [42] consider the typographical mistakes
in names as a separate name disambiguation problem. These problems may arise due to the use of abbreviations,
spelling mistakes; and occasionally using caste or family name at the end or at the beginning of names. L. Branting
[44] has discussed nine different types of name variations.
B. NEED FOR THE SOLUTION
Name ambiguity may cause incorrect authorship identification in literary works resulting in improper credit attribution
to the authors. AND is a basic and compulsory step for performing bibliometric and scientometric analyses.
Disambiguating authors may help establish precisely, author profiles, co-author networks, and citation networks. In
academic digital libraries, disambiguating author names is necessary for the following reasons.
Users are interested in finding papers written by a particular researcher [45]
Research communities and institutions can track the achievements of their researchers [46]
It also helps in expert finding from which publishers can easily find paper reviewers [47]
C. CHALLENGES INVOLVED IN AND
Certain challenges are involved in AND, some of which are highlighted in the following.
Lack of identifying information: The identifier metadata are either incomplete or not available at all.
Multi-directional problem: multi-disciplinary papers authored by multiple researchers from multiple institutions
(nationwide or worldwide) may cause ‘multiple entities disambiguation’ problem.
Less number of papers by most of the authors: The machine learning techniques used for AND give better results
when a reasonable number of examples are available. This is only possible when the individual authors have
produced many papers. In MEDLINE almost 46% of the authors have written only one paper [48]. The authors
having one or a few papers are a big hindrance for proposing precise machine learning techniques.
Heterogeneous nature of BD: The BD are heterogeneous in many ways, like schema heterogeneity, discipline
heterogeneity, language heterogeneity and attributes heterogeneity.
The non-serious attitude of the authors: Sometimes the authors are reluctant in registering a universal
identification system like UAI_Sys [49] or [50] or making consolidated profiles.
Economic issue: The construction of such a database that can accommodate and manage the worldwide
researcher’s community including all the disciplines, nations, and languages is not only economically unfeasible
but also probably impossible.
Ownership issue: While testing the algorithm for AND sometimes confirmation of the original author becomes
doubtful.
D. IS A UNIQUE IDENTIFIER FOR AN AUTHOR A VIABLE SOLUTION?
One may think that unique identifiers, say, Author Identification Number (AID), can be a simple and reliable solution
for this problem. Dervos et al. [49] proposed UAI_Sys in which an author can register himself/herself by entering
his/her metadata information. The UAI_Sys in return assigns a 16-digit unique code to the author. ORCID [50] is also
a similar attempt for the same purpose, it issues 16 characters alphanumeric code to the researcher to uniquely identify
them. It offers a permanent identity for people, just like the ones issued for content-related entities on digital networks
by digital object identifiers Although it seems possible apparently, however, there are so many issues discussed in this
section that are very difficult to address and implement.
In Dervos et al. [49] project it is expected that authors would remember their passwords and UAIs. Researchers do
not pay attention to remember such lengthy codes. Further, all the co-authors are also bound to be registered with the
universal bibliographic database. A large number of authors may produce 2 or 3 papers in their whole life. Such casual
researchers take the least interest to be registered in the database. It is not only the casual researchers but regular
researchers (who produce a reasonable number of research papers) may also provide wrong metadata information to
the system. Sometimes it is too difficult to convince a researcher to be habitual to welcome new technologies. They
may resist giving up previous practices and adopting new ones.
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 90
If such a database is developed, ideally it should accommodate all the research areas, languages, states, and all types
of publications. Such a database seems not to be economical as it demands not only one-time expenses (developing
cost) but also huge running expenses including staff salaries, maintenance, and security of the database, and handling
the user queries.
E. MATHEMATICAL NOTATIONS
Table 1 provides the mathematical notations used in this paper.
T
ABLE
1
M
ATHEMATICAL NOTATIONS
Symbols Sets Description
A
A= {a
1
, a
2
, …, a
k
}, where a
i
is the ith author.
k is no. of unique authors sharing an ambiguous
name
Set of authors/persons sharing an ambiguous name
D
D= {d
, d
2
, …, d
d
}
Set of documents in a dataset
P
P= {
P
1
,
P
2
, …,
P
p
}
Set of publications/documents associated
with
an ambiguous author/name
K
No. of clusters = No. of unique authors
associated
with
an ambiguous name
V V = {v
1
, v
2
, …, v
v
}, where v is the number of
vertices
Set of vertices in a graph
E E = {e
1
, e
2
, …, e
e
}, where e is the number of
vertices
Set of edges in a graph
N
Number of unique authors
w
Set of words
t
Term, can be a word or set of words
III. ANDBD Process
In this section, we describe the general process of AND
BD.
We do not follow the process exploited by any particular
research work. We provide the common steps involved in AND
BD
process. The purpose of this section is to help
readers comprehend AND
BD
task more easily and clearly. Figure 1 is the block diagram of the AND
BD
process.
Figure 1. ANDBD process
A. PREPARING THE DATASET
For AND a BD is used. The whole database is normally too large to analyze, within a limited time. To avoid killing
time in query processing in real-life databases, a tiny dataset is either selected from a functional BD or prepared from
scratch normally by crawling the web pages of ambiguous authors. For example, Han et al. [51] exploit two datasets,
one for 15 different “J. Anderson”s, and the other for 11 unique “J. Smith”s; while Wang et al. [52] used a dataset
containing 16 ambiguous names comprising 241 unique authors. Preprocessing in name disambiguation usually
includes blocking, stop-word removal, and stemming [53]. Stop-word removal and stemming steps are required for
the title words of publications and venues. A blocking step is performed to group together the authors with ambiguous
names. Disambiguation operations are performed within each ambiguous group to avoid useless comparisons and
operations involving non-ambiguous authors.
B. SELECTING THE PUBLICATION ATTRIBUTES
It is always desirable to utilize as many attributes of the publications as available though only useful ones are
considered. All BD do not provide the same number and type of attributes. But three common attributes: co-authors,
publication title, and venue; are available in almost all of them. We name these three attributes as triplet attributes.
Most of the studies like [51] use only triplet attributes, [40] exploits triplet attributes plus topic similarity. Some
Preparing the
dataset: Papers
list of ambiguous
name
Clusters:
No. of clusters equal to
No. of unique authors
Selecting
paper
attributes
Similarity (b/w
papers) estimation
modules
Performance evaluation
Methods
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 91
methods like [52], [54] take advantage of indirect co-authors, feedback, co-web, and publication year along with triplet
attributes. Torvik et al. [55] propose eight different attributes: (1) middle initial, (2) suffix (e.g., Prof. or II), (3) full
name, (4) language, (5) number of common co-authors, (6) number of common title words, (7) number of common
affiliation words and (8) number of common Medical Subject Headings (MeSH) words. As we add more and more
attributes, usually the accuracy increases a bit at the cost of time complexity. In AND time complexity is not much
cumbersome, however, unavailability of reasonable number of distinguishing attributes is a bottleneck.
C. SELECTING THE SIMILARITY ESTIMATORS
After the selection of available attributes, the most technical task is to select a proper similarity estimator for the
attributes. Almost all the methods in AND, work on the notion that the more the similarity values among the attributes
of the two citations, the more it is plausible that they belong to the same author. The focus of the proposed similarity
estimators is always to estimate the optimum similarity value among the attributes of the two papers. Various similarity
estimators for each type of attribute are exploited by the researchers. For example, Shu et al. [40] used edit distance
of two strings for co-author attribute, cosine similarity measure for the title and venue attributes, and Latent Dirichlet
Allocation (LDA) [56] topic model for semantic topic similarity.
D. SELECTING THE MODELS
In this study, we categorized the AND methods into five types (1) supervised learning (2) unsupervised learning (3)
semi-supervised learning (4) graph-based, and (5) ontology-based. Supervised learning models perform classification,
unsupervised learning methods perform clustering and semi-supervised models are a combination of both supervised
and unsupervised methods. Graph-based methods exploit links and ontology-based methods exploit semantics-based
relationships between entities. The purpose of all methods is to separate the publications of a unique author into a
unique class/cluster. A large number of methods are available, so first of all one must decide which type of method
will be employed. The pros and cons of each alternative are kept in mind before applying the method. One can think
to devise his/her new method as well. SVM and decision tree algorithm C4.5 classifiers are widely used classification
models in AND. On the other hand, random forests, spectral clustering, and DBSCAN are popular clustering models.
E. MEASURING THE PERFORMANCE
The performance of the method used is measured using different performance metrics. Precision, recall, and F-measure
are very common performance metrics used for the evaluation of AND methods
.
IV. Datasets
The well-known BD like DBLP, MEDLINE, DBComp, Scopus, and CiteSeer have been widely utilized by the
researchers for AND. DBLP is the most widely used database for this purpose. Its basic reason, perhaps, is that the
publication records in DBLP are represented in a well-structured format, i.e., XML. The basic issue faced by the
researchers is how to measure the performance of the proposed method with standard/huge databases. For this purpose,
they pick a few ambiguous names from the database along with their publications and other discriminative attributes
and investigate the performance of their proposed method.
For example, Han et al. [51] exploited two types of datasets: (1) Collected manually from the web by querying Google,
and (2) selected ambiguous names from DBLP. The first dataset consists of two ambiguous names “J. Anderson” and
“J. Smith”. “J. Anderson”. Part of the dataset consists of 15 unique authors who share the same name, and 229
publications; “J. Smith” is shared by 11 different authors whose total publications are 338. “J. Anderson” part of the
first dataset is shown in Table 2. Tables 2, 3, and 4 show some examples of name ambiguity. We can see from Table
2 that there are 15 different people whose first name is James, and the last name is Anderson. However, they have a
different middle initial. All these names can appear in a publication as J. Anderson, and it needs to be resolved that
which J. Anderson is actually intended. The second dataset consists of 9 ambiguous names with each having more
than 10 name variations, as shown in Table 3. These datasets, later on, were used by many other works like [34], [57].
Ferreira et al. [58] also used two datasets. They collected records from DBLP and DBComp. The statistics are given
in Table 4. Many other studies like [34], [57], [59], [60] have used these dataset with some variations. Reuther [61]
investigated the existing test collections and proposed three new test collections to resolve the name variant problem.
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 92
TABLE 2
“J. ANDERSON PART OF FIRST DATASET USED BY HAN ET AL. [51]
Full Name Affiliation No. of
Pubs
Full Name Affiliation No. of
Pubs
James Nicholas Anderson
UK Edinburgh
8
James D. Anderson
Univ. of
Toronto
5
James E. Anderson
Boston College
14
James P. Anderson
N/A
3
James A. Anderson
Brown University
3
James M.
Anderson
N/A
5
James B. Anderson
Penn. State Univ
6
James Anderson
UK
19
James B. Anderson
Univ. of Toronto
21
James W.
Anderson
Univ. of KY
10
James B. Anderson
Univ. of Florida
17
Jim Anderson
Univ. of Southampton
20
James H. Anderson
Univ. of North Carolina
54
Jim V. Anderson
Virginia Tech Univ.
40
James H. Anderson
Stanford Univ.
4
TABLE 3
SECOND DATASET USED BY HAN ET AL. [51]
Ambiguous Names Name Variations No. of Pubs Ambiguous Names Name Variations No. of
Pubs
S Lee
35
467
C Lee
18
152
J Lee
33
330
A Gupta
16
332
J Kim
25
239
J Chen
13
174
Y Chen
24
201
H Kim
11
120
S Kim
20
181
TABLE 4
DATASETS USED BY FERREIRA ET AL. [58]
DBLP DBComp
Ambiguous Names No. of
Authors
No. of Pubs Ambiguous Names No. of
Authors
No. of Pubs
A. Gupta
26
576
A. Oliveira
16
52
A. Kumar
14
243
A. Silva
32
64
C. Chen
60
798
F. Silva
20
26
D. Johnson
15
368
J. Oliveira
18
48
J. Martin
16
112
J. Silva
17
36
J. Robinson
12
171
J. Souza
11
35
J. Smith
29
921
L. Silva
18
33
K. Tanaka
10
280
Silva
16
21
M. Brown
13
153
R. Santos
16
20
M. Jones
13
260
R. Silva
20
28
M. Miller
12
405
V. SIMILARITY METRICS
Selecting an appropriate similarity metric/distance function is a technical and challenging task [62] in AND. It is
advisable to employ the best fit similarity measure for each attribute of the publications. No single metric is the best
fit for all the attributes. Cohen et al. [63] compared different similarity metrics for name matching and concluded that
a combination of metrics provides better results than any single metric. Most of the similarity measures do not make
use of the semantics of the publications and use syntactic characteristics only, so we categorize these metrics into two
types (1) syntactic and (2) semantic similarity metrics.
A. SYNTACTIC SIMILARITY METRICS
The similarity metrics that match the strings exactly and do not care about synonymy and polysemy are syntactic
similarity metrics. The similarity of the two publications can be obtained by cosine, Euclidean, Manhattan, Jaccord,
Jaro, Winker, and TFIDF. These metrics often outperform Levenshtein-distance-based techniques [63]. Besides these
metrics, many other measures like typewriter distance, Jaro-Winkler, Monge-Elkan, or phonetic distances can also be
employed. The most used metrics of subcategories are (1) edit distance and (2) token-based distance metrics of
syntactic similarity.
1) EDIT DISTANCE METRICS
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 93
Distance functions map two strings S1 and S2 to a real number r, where a larger value of r indicates greater distance or
smaller similarity between S1 and S2. String distances are most useful for matching problems with little prior
knowledge and/or ill-structured data [63]. A variety of edit distance functions are used in text mining tasks. The edit
distance of two strings (names) is the minimum number of operations required to transform one string to the other.
These operations include insertion, deletion, and replacement of a character. A good comparison of name matching
techniques is given in [63].
The most simple is Levenshtein distance [63] that assigns a unit cost to all edit operations. Monger-Elkan distance
function [64] is more complex and well-tuned with particular cost parameters and is scaled to the interval (0, 1). It is
a variant of the Smith-Waterman distance function [65] and assigns a relatively lower cost to a sequence of insertions
or deletions.
Shu et al. [40], Bhattacharya and Getoor [39], Torvik et al. [55], and Smalheiser and Torvik [32] utilized edit distance-
like measures for measuring name distance of the co-authors of two citations. Shu et al. [40] applied rule-based
methodology along with edit distance.
A little bit similar metric, but not based on the edit distance model is the Jaro metric [66], which is based on the
number and sequence of the common characters between the two strings [37], [42], [53]. A variant of this function is
Jaro-Winkler [67], which exploits the length of the longest common prefix between S1 and S2 [37], [42], [53], [68].
2) TOKEN BASED DISTANCE METRICS
Token-based distance metrics compare words of the two strings S1 and S2 rather than the characters. Euclidean
distance is commonly used for text clustering problems and similarity estimation [28], [36], [54], [57], [69]. Let d1
and d2 represent vectors of two documents then the Euclidean distance between the two documents can be calculated
as:
𝑫𝑰𝑺𝑻𝑬(𝐝𝟏,𝐝𝟐)=|𝒘𝒕𝟏 − 𝒘𝒕𝟐
𝒏
𝒕𝟏 |𝟐… …… … …… … (𝟏)
where, term frequency ti T and T = {t1, . . ., tn}.
Term Frequency Inverse Document Frequency (TFIDF) is the frequency of word w in an attribute of a publication,
and IDF is the inverse of the fraction of words in the dataset that contains w and is used by [34], [37], [42], [53], [70],
[71]Error! Reference source not found.. Cohen et al. [63] considered a soft version of TFIDF in which similar tokens
are also considered along with tokens in S1 S2. Most of the research works like [37], [38], [40], [51]–[54], [58] use
the cosine similarity that exploits TFIDF and vector space model (VSM) [72]. Normally this function is used for title
and venue attributes. Although, it can be used for any attribute represented in the form of vectors. The documents are
represented in vector space. Let d1 and d2 represent vectors of two documents then the cosine similarity between the
two documents can be calculated as:
𝑆𝐼𝑀(d,d)= 𝐶𝑜𝑠𝑖𝑛𝑒 𝛳 = .
||.|| (2)
Jaccard coefficient, also called the Tanimoto coefficient, is the ratio between the intersection and the union of the
objects. It compares the sum weight of common terms to the sum weight of terms that are present in either of the two
documents except for the common terms [36], [37], [42], [53], [71]. Let d1 and d2 represent vectors of two documents.
The Jaccard coefficient between the two documents is:
𝑆𝐼𝑀(d,d)=.
||||. (3)
A document can also be considered as a probability distribution of terms in probability theory. The similarity between
the two documents can be calculated by measuring the distance between the two corresponding probability
distributions. Let d1 and d2 represent vectors of two documents, the KL divergence between the two distributions of
words is calculated as: 𝐷(d|| d)= 𝑤
 X 𝑙𝑜𝑔 
 (4)
The KL divergence is not symmetric on the other hand average KL divergence is symmetric, which is why the average
KL divergence is more popular. The average weighted KL divergence from di to dj is the same as that of from dj to di.
This average weighting between two vectors of the two corresponding documents guarantees symmetry. For text
documents, the average KL divergence between the two distributions of words is calculated as:
𝐷(d|| d)= (⌅X 𝐷 (𝑤||𝑤)
 + (⌅X 𝐷 (𝑤||𝑤)) … … …… … …… (5)
where, = 
  , = 
  and 𝑤= ⌅𝑋 𝑤 + 𝑋 𝑤
B. SEMANTIC SIMILARITY METRICS
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 94
The measures discussed above help in estimating pair-wise similarities between the corresponding attributes of the
publications. They usually exploit syntactic characteristics and are unable to utilize the Synonymy and Polysemy-
based semantics of publications. The topic models such as PLSA [73] and LDA [56] provide excellent ways to exploit
semantics. A publication mostly contains multiple topics, and it is important to find the topic similarity between the
two publications. Generally, a topic is a semantically related probabilistic cluster of terms (words). Here, we describe
LDA which can capture semantics in an unsupervised way. It is a generative probabilistic model for text corpora [48],
[56], [74] at the words and documents level. It assumes every document as a mixture of topics and every topic as a
Dirichlet distribution over words in the vocabulary. It has been used for finding topic similarity among the publications
[28], [39], [51]. Shu et al. [40] and Song et al. [28] extend the LDA model and apply it to AND. The probability of
generating word w from document d is given as:
𝑃(𝑤|𝑑,𝛳, 𝛷)=
 𝑃(𝑤|𝑧, 𝛷)𝑃(𝑧|𝑑,𝛳)(6)
Where, w is vector form of d, z is topic and 𝜭𝒅,𝜱𝒛 are multiple distributions over topics and over words specific to z,
simultaneously.
VI. APPROACHES FOR ANDBD
Much research work has been done on entity resolution in a variety of research areas. In the field of databases, studies
are made on merge/purge [75], record linkage [76], duplicate record detection [77], data association [78] and database
hardening [79]. In Natural Language Processing (NLP), Cross-Document Co-Reference [80] methodologies and name
matching algorithms [44] are designed. In BD, several methods or models are employed, such as, citation matching
[81], k-way spectral clustering [34], social network similarity [35], mixed and split citation [37], Latent Topic Model
[40], latent Dirichlet model [39], Random Forests [42], Graph-based GHOST [43] and Ontology-based Category
Utility [82].
A variety of solutions [32] [72] ranging from the manual assignment by librarians [34], [83] to unsupervised learning
are provided for AND. Most of the researchers categorize ANDBD in supervised, unsupervised, and semi-supervised
learning methods. The graph-based and ontology-based methods have also been applied to resolve AND. We have
classified methods for AND in the following five categories. Each category is explained in chronological order with
discussions about its pros and cons.
A. SUPERVISED LEARNING METHODS
In supervised learning [42], [51], [55], [57], [84]–[86], the major objective is to find class labels by exploiting the
related information. Supervised learning is labor-intensive, costly, and error-prone if labeling or training of the dataset
is not performed properly. Supervised learning methods achieve better performance as compared to those of
unsupervised learning methods with the tradeoff of expensive labeling labor and time consumed. Supervised methods
may be exploited to predict an author's name in a citation [51] or to disambiguate publications of a particular author
[42], [55], [84], [85].
Han et al. [51] proposed two supervised methods to disambiguate author names in the publications using VSM [72],
[87] for the representation of publications; and cosine similarity for calculating the pair-wise similarity of publication
attributes. They propose canonical names by grouping together author names with the same first name initial and the
same last name. Each canonical name is associated with all those publications, where that name appeared. First method
applies naive Bayes probability model [88] and the second Support Vector Machines (SVMs) [89]. Both methods
exploit triplet1 attributes for similarity calculations. This famous work is the enhancement of Han et al. [90] where
they exploited k-means clustering along with the Naïve Bayes model using the same dataset and attribute set.
Torvik et al. [55] proposed an authority control framework to resolve only the name-sharing problem for MEDLINE
records by using eight different attributes. They calculated the pair-wise similarity profile based on these attributes
and decide whether a pair of publications containing the same name of an author belongs to a single individual. Culotta
et al. [84] proposed a method that overcomes the problem of transitivity produced due to pair-wise comparisons. A
researcher can have multiple papers, email addresses, and affiliations. While comparing the publications of such
authors the pair-wise classifier cannot handle multiple instances of an attribute. They employed the sets rather than
pair-wise comparisons and addressed the transitivity issue between co-authors in a better way. The comparison of a
1 In this article we refer co-authors, title, and venue attributes as triplet attributes.
-
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 95
new publication is made with all the publications in a cluster rather than the pair-wise comparisons. By comparing a
publication with sets makes it possible to handle the multiple values of an attribute.
Yin et al. [36] focused name-sharing problem by considering only identical names. DISTINCT, an object distinction
methodology to disambiguate authors is proposed. They combine set resemblance of neighbor tuples and random walk
probability between the two records of a relational database. SVM [89] is applied to assign weights to various types
of links in the graph and agglomerative hierarchical clustering to get final clusters.
Torvik and Smalheiser [85] enhance their work [55] by (a) including first name and its variants, emails, and
correlations between last names and affiliation words; (b) employing new procedures of constructing huge training
sets; (c) exploiting methods for calculating the prior probability; (d) correcting transitivity violations by a weighted
least squares algorithm; and (e) using an agglomerative algorithm based on maximum likelihood for calculating
clusters of articles that represent authors. The work proposed in [55] was not scalable which is usually a problem of
most AND methods. The above enhancements make it scalable for a huge dataset like MEDLINE records.
Pucktada and Giles [42] resolve the name-sharing problem in MEDLINE records. They introduce Random Forest
classifier to find a high-quality pair-wise linkage function. They define similarity profile by considering 21 attributes
categorizing them into six types of attributes; three of them are triplets and the other three are: affiliation similarity,
concept similarity, and author similarity. They use a naive-based blocking procedure. This procedure uses the author’s
last name and the first initial to block the author’s name that does not share both parts of the author’s name. They
compare the results with SVM. Their results show that Random Forests outperform SVM.
Qian et al. [86] proposed Labeling Oriented Author Disambiguation (LOAD) to resolve author name disambiguation
problem. LOAD exploits supervised training for estimating similarity between publications using High Precision
Clusters (HPCs) for each author to change the labeling granularity from individual publications to clusters. Labeling
HPCs decreases labeling effort at least 10 times as compared to the labeling publications. Found HPCs are clustered
into High Recall Clusters (HRCs) to place all publications of one author into the same cluster. For pair-wise
comparisons, LOAD employs rich features like name, email, affiliation, homepage between two authors, co-author
name, co-author email, co-author affiliation, co-author homepage, title bigram, reference, and download link. Besides,
self-citation and publishing year, the interval between two papers are also considered.
The methods discussed above perform name disambiguation in an offline environment. Different from them, Sun et
al. [91] proposed a publication analysis system. The focus of the system was to decide, at query time by involving the
user, if the queried author name matches the given set of publications retrieved from the Google Scholar database.
The system exploits two kinds of heuristic features (1) number of publications per name variation, and (2) publication
topic consistency. Topic consistency exploits discipline tags crowd-sourced from the users of the Scholarometer
system [92]. They train the binary classifier on a dataset of 500 top-ranked authors from scholarometer database1 by
manually labeling either ambiguous or unambiguous, and examine the publications retrieved from Google Scholar for
each queried name. To the best of our knowledge, this is the first work addressing real-time author name
disambiguation and achieves 75% accuracy.
Zhang et al. [93] proposed a Bayesian non-exhaustive classification method for resolving online name disambiguation
problems. They considered a case study for bibliographic data and involved a temporal stream format for
disambiguating authors by dividing their papers into similar groups. Table 5 provides a quick summary of the methods
based on supervised learning models.
TABLE 5
SUMMARY OF SUPERVISED LEARNING METHODS
Reference
# Problem Tool / Method Attributes /
features Compariso n with Dataset Finding Limitation
Han et al.
[51] 2004
Disambiguate names in
citations
Naive Bayes
probability model,
SVM
Co-author
names, paper
title, venue
Comparison of
both approaches
and their hybrid
approach
Publications
from web,
DBLP
Hybrid of naive
Bayes outperforms
Hybrid I scheme of
SVM
Not flexible, not
topic sensitive
Torvik et al.
[55] 2005 Resolve name sharing Authority control
framework
8 different
attributes
Comparison is
performed with
manually labeled
data only
Medline
Different articles
authored by the
same individual will
share similarity in
one or more aspect
of Medline records
No comparison with
state-of-the-art,
Specific to Medline
records only
1 scholarometer.indiana.edu
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 96
Culotta et
al. [84]
2007
Transitivity due to pair-
wise comparisons
Supervised machine
learning, error-
driven, rank-based
training
Examining sets
of records not
pairs
Approach is
evaluated on
three different
datasets
Penn,
Rexa,
DBLP
Error reduction of
60% over standard
binary classification
approach
Not topic sensitive,
Not compared with
state-of-the-art
Yin et al.
[36] 2007 Name sharing problem
Supervised and un-
supervised set
resemblance and
random walk
Fusion of
different type of
subtle linkages
Comparison of
both approaches
and their hybrid
approach
DBLP
Fusing difference
type of linkages and
combining set
resemblance of
neighbor tuples and
random walk
probability is
effective
Not compared with
state-of-the-art,
Specific to authors
with identical name
only
Torvik and
Smalheiser
[85] 2009
Enhancement of [23]
Estimating the
probability that two
articles sharing
same name, were
written by same
individual
Adding 5 more
variants to [23] [23] Medline
Author-ity model
with more scalability
and recall
Not high
performance, model
will fail to apply to
scientists whose
research output is
diverse
Pucktada
and Giles
[42] 2009
Name sharing problem
Random Forest
classifier, naive
based blocking
21 different
attributes SVM Medline
Random Forest
classifier
outperforms SVM
High accuracy can
be achieved with a
relatively small set
of features.
Qian et al.
[86] 2011
Labeling Oriented
Author Disambiguation
Estimating similarity
between
publications using
High Precision
Clusters
Set of rich
features
Human labeling
after conventional
automatic author
disambiguation
CS, UE and
DBLP
Machine Learning
combined with ceiv
judgement produce
more accurate
results to assist and
reduce human
labeling
No Iterative process
for AND, Limited
usage of feature
sources, non usage
of direct
optimization
algorithms
Sun et al.
[91] 2011
Detect ambiguous
names at query time
Finding ambiguities
from crowdsourced
annotations
Number of
citations per
name variation,
publication topic
consistency
For each
combination of
features,
accuracy, area
under curve and
F1
Papers
retrieved
from google
scholar
Improved accuracy
Publication
metadata was not
considered
Zhang et al.
[93] 2016
Online name entity
disambiguation
Dirichlet process
prior with a Normal
× Normal × Inverse
Wishart data model
Temporal
stream format
Qian’s Method
[63], Khabsa’s
method [64]
AMiner
Proposed method
outperforms the
state-of-the-art
methods
Computational
complexity depends
upon several factors
and can be variable
B. UNSUPERVISED LEARNING METHODS
Unsupervised learning methods [28], [34], [35], [39], [59], [60], [70], [94]–[99] do not need manual labeling. Instead,
they carefully choose features to classify similar entities into clusters. Various clustering algorithms are applied to
cluster similar entities. Giles et al. [34] apply a k-way spectral clustering method to resolve AND. Unsupervised
learning methods save labeling time with the tradeoff of efficiency and precision. However, in many dynamic
scenarios, unsupervised learning methods are better solution than supervised learning methods.
The unsupervised methods may utilize similarities between publications with the help of a predefined set of similarity
functions to group the publications for a particular author. These functions are usually defined over the features present
in the publications [34], [35], [59], [94]–[97]. These features are also called the local information [40] as they are
apparently available in the publication. The similarity functions may also be defined over implicit information such
as topics of the publication [36], [40], [60] or Web data [60], [98], [99]. The information about the topic(s) of the
publication is not explicitly present in the publication under consideration rather it is derived from the dataset hence
called the global information [40].
Giles et al. [34] improved their previous work [51] by applying k-way spectral clustering [34] for AND using the
triplet attributes for similarity measuring. Malin [35] applied hierarchical clustering and random walk to resolve name
sharing and name variant problems. The main limitation of this method is a static threshold which is used as a stopping
criterion of the clustering process. Bekkerman and McCallum [70] resolve the name ambiguity problem. They present
two frameworks: the first one uses the link structure of Web pages, and the second exploits A/CDC (Agglomerative /
Conglomerative Double Clustering). Their methods require a minimum of the prior knowledge as provided in BD.
However, their methods best fit web appearances instead of BD.
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 97
Bhattacharya and Getoor [39] referred AND as entity resolution problems and extend LDA topic model [56]. They
suppose that authors who belong to one or more groups of authors, may co-author papers and simultaneously discover
the clusters of authors and clusters of papers written by these authors. They perform parameter estimation through
Expectation-Maximization (EM) algorithm along with Gibbs sampling [100]. The extended model is about 100 times
slower than an alternative method [95] and solves only the name variant problem. Bhattacharya and Getoor [95]
proposed a collective entity resolution method as an improvement to their previous work [39]. Given two papers both
written by authors a1 and a2, if the two instances of a2 refer to the same individual, then it is likely that both instances
of a1 refer to the same entity. Resolving this 2nd level ambiguity helps in cases where there is a high level of ambiguity.
They treat high versus low ambiguity scenarios separately. They first address the most confident assignments and then
less confident ones. The final similarity value between the two citations is calculated based on pair-wise comparisons
and previously disambiguated authors. The weighting parameter is adjusted manually, and it may take different
optimal values across different contexts. Although this method is an advancement to their previous work [39] yet
scalability was still a problem.
Cota et al. [96] proposed a heuristic-based hierarchical clustering that successively combines clusters of citation
records of the ambiguous authors. In the first step, the compatibility of the ambiguous author names was found. If the
two names in two publications are compatible, then they are further compared against common compatible co-
author(s). The two publications are merged to a cluster if a compatible co-author is found, else they form separate
clusters. The resulting clusters are almost pure but fragmented. To decrease the fragmentation, they use the second
step in which clusters are compared in a pair-wise fashion exploiting title and venue attributes. The major distinction
of this method was that it compares all the titles and venues of a cluster with that of other clusters applying bag of
words approach. If the similarity between two clusters reaches a threshold value, then they are fused to one cluster
otherwise they remain separate clusters. They claim improvements up to 12% against non-hierarchical clustering, 21%
against SVM, and 15.5% against K-means using the same attributes.
Song et al. [28] proposed an algorithm based on Probabilistic Latent Semantic Analysis [73] and Latent Dirichlet
Allocation [56] to deal with AND exploiting the contents of the articles. They exploited metadata of publications and
authors and publication’s first page to relate authors to topics.
Shin et al. [101] proposed AND framework by constructing a social network for finding semantic relationships
between authors and solves name sharing and name variant problems simultaneously. They employ two methods: one
for namesake names and the other for heteronymous names. A social network is constructed in three steps. (1)
Information extraction: extraction of paper title. (2) Candidate topics extraction: extraction of topics that are
representative of the publication. These candidate topics are extracted from the abstract of the publication using
morphemic analysis [102]. (3) Social network construction: the social network is constructed based on the above two
types of information. They used the cosine similarity metric for finding similarity among two social networks.
Yang and Wu [103] resolves name sharing problem by exploiting triplet attributes along with web attributes. They
use Cosine and Modified Sigmoid Function (MSF) for triplet attributes, and Maximum Normalized Document
Frequency (MNDF) for web attribute, to estimate the pair-wise similarity between the publications. They also
employed a binary classifier to reduce the noise in the clustering publications.
Tang et al. [29] formalize the problems for name disambiguation in a unified probabilistic framework. The framework
uses a Markov Random Fields (MRF) [104] exploiting six local (publication) attributes (content based information)
and five relationships (structure based information) between the pair of publications. The framework, on one hand,
achieves better accuracy than baselines but, on the other hand, its time complexity is almost twice as compared to
baselines.
Wu et al. [105] used Dempster-Shafer theory (DST) for AND. They proposed an unsupervised DST based hierarchical
agglomerative clustering algorithm which is used with a combination of Shannon’s entropy to blend disambiguation
attributes for more reliable candidate pair of clusters for union in each repetition of clustering. Qian et al. [106]
proposed a dynamic method for author name disambiguation keeping the growing nature of digital libraries in mind.
They proposed a two-step process, BatchAD+IncAD, which first performs AND by grouping all records into disjoint
clusters, and then it periodically performs incremental AND for newly added papers and determines that new papers
belong to an existing cluster or forms a new one. Khabsa et al. [107] proposed a constraint-based clustering algorithm,
that allows constraints to be added to the clustering process and allowing the data to be added as well, in an incremental
way. This methodology helps the users by allowing them to make corrections to disambiguated results. The method
is based on a combination of DBSCAN and pairwise distance based on random forests. Sun et al. [108] proposed an
unsupervised method based on topological features AND solution. To measure the similarity of publications the
method includes a structure similarity algorithm along with a random walk with restarts. Table 6 includes a summary
of methods that involve unsupervised learning methods for AND.
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 98
TABLE 6
SUMMARY OF UNSUPERVISED LEARNING METHODS
Reference # Problem Tool / Method Attributes /
features
Comparison
with
Dataset Finding Limitation
Glies et al. [34]
2005
Disambiguation in
Author Citations
K-way Spectral
Clustering
Co-author
names, paper
titles, and
publication
venue titles
Evaluation
based on
confusion matrix
DBLP Spectral
methods
outperform k-
means
Not compared with any
state-of-the-art
Malin [35] 2005 Name sharing
and name variant
problems
Hierarchical
clustering and
random walk
Actor lists for
movies and
television
shows
Consideration
as baseline 1)
ambiguous
names are
distinct entities
2) ambiguous
names are
single entity
IMDB Mea suring
similarity based
on community,
rather than
exact similarity
is more robust
Not compared with any
state-of-the-art
Bekkerman and
mccallum [70]
2005
Finding Web
appearances of a
group of people.
Link structure of the
Web pages, another
using
Agglomerative/Cong
lomerative Double
Clustering (A/CDC)
Only affiliation
of a person
with a group is
required
Traditional
agglomerative
clustering
Hand-labeled
a dataset of
over 1000Web
pages
Improved F
measure
Relational structure of
relevant classes is not
considered
Bhattacharya
and Getoor [39]
2006
Entity resolution Probabilistic model,
extended LDA
Decisions not
on independent
pairwise basis,
but made
collectively
Hybrid softtf-IDF
[31]
Citeseer, arxiv
(HEP)
Exploits
collaborative
group structure
for making
resolution
decisions
Cannot resolve multiple
entity classes
Bhattacharya
and Getoor [95]
2007
Entity resolution Relational clustering
algorithm
Attribute-based
baselines
Attribute-based
entity resolution,
naïve relational
entity resolution,
collective
relational entity
resolution
Citeseer, arxiv,
biobase
Improved
performance
over baselines
Manually adjusted
weighting parameter
which can have different
optimal values. Not
scalable
Cota et al. [96]
2007
Disambiguation in
split citation and
mixed citation
Heuristic-based
hierarchical
clustering
Authors, title of
the work,
publication
venue
SVM, K-Means DBLP Improved
performance
over baselines
Compared only with
unsupervised methods
Song et al. [28]
2007
Disambiguation
exploiting
contents of the
articles
Two stage approach
based on LDA and
PLSA
Person names
within web
pages and
scientific
documents
Spectral
clustering and
DBSCAN
Citeseer Improved
scalability
Compared only with
unsupervised methods
Shin et al. [101]
2010
Finding semantic
relationships
between authors
and name
sharing
Methods for
namesake names
and heteronymous
names
Paper titles and
topics
Comparison
among two
social networks
with cosine
similarity
DBLP Improved
effectiveness
--
Yang and Wu
[103] 2011
Name sharing
problem
Cosine, Modified
Sigmoid Function,
and Maximum
Normalized
Document
Frequency
Triplet
attributes along
with web
attribute
Compared with
[34]
DBLP Dataset
constructed by
[34]
Improved
accuracy
Cluster separator filtered
out some correctly
matched pairs from the
datasets
Tang et al. [29]
2012
Disambiguation,
how to find
number of people
“K”
Probabilistic
Framework
Attributes of
publications
and
relationships
Four baseline
methods
AMiner Performs better
than baseline
and “K” is close
to real
--
Wu et al. [105]
2014
Name
disambiguation
DST based
unsupervised
hierarchical
Three
unsupervised
models
Performance
comparable to a
--
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 99
agglomerative
clustering
supervised
model
Qian et al. [106]
2015
Dynamic
disambiguation
Batchad+incad
framework
Authors
metadata
Five state-of-
the-art batch AD
methods
Two labeled
data sets,
case study
and DBLP
Improved
efficiency and
accuracy
Erroneous results when
an author changes
affiliation or topic
Khabsa et al.
[107] 2015
Disambiguation
with constraints
DBSCAN and
pairwise distance
based on random
forests.
Metadata
information and
citation
similarity
Models with
different
combination of
features
Citeseer Improved
pairwise and
cluster F1
DBSCAN cannot split an
impure cluster
A. SEMI-SUPERVISED METHODS
Semi-supervised Learning approaches [58] have also been applied to AND in BD. It combines the characteristics of
both supervised and unsupervised methods.
On et al. [53] proposed the framework for resolving the name variant problem in two steps: (1) blocking and (2)
distance measurement. They used four blocking methods that reduce the candidates, and seven unsupervised distance
measurements that measure the distance between the two candidate publications to decide whether they belong to the
same entity. They also exploit two supervised algorithms Naive Bayes model [88] and the Support Vector Machines
(SVMs) [89] to separate the publications of an author in a separate cluster.
Lee et al. [37] called the name sharing problem as a mixed citation and name variant as a split citation problem. They
used Naive Bayes model and SVM (supervised methods); and cosine, TFIDF, Jaccard, Jaro and JaroWinkler
(unsupervised methods) to resolve the name disambiguation problem.
On et el. [71] again focused on the name variant problem and call it Grouped-Entity Resolution (GER) problem. They
propose Quasi-Clique, a graph partition-based method. Unlike previous text similarity approaches like string distance,
TFIDF or vector-based cosine metric, their approach investigates the hidden relationship under the grouped entities
using Quasi-Clique technique.
Huang et al. [109] resolve both types of problems on a small dataset selected from CiteSeer. They employed an online
SVM algorithm (LASVM) as a supervised learner of finding the distance metric of the publication attributes by pair-
wise comparisons. The supervised learner easily handles the new papers with online learning. For clustering the
publications of the authors, they used DBSCAN algorithm that constructs the clusters on multiple pair-wise similarities
and handles the transitivity problem. They use different similarity metrics for different attributes, e.g., edit distance
for URLs and emails, Jaccard similarity for affiliations and addresses, and Soft-TFIDF [110] for author names.
Zhang et al. [54] proposed a semi-supervised name disambiguation probabilistic model with six constraints. They
consider following constraints: (1-3) triplet attributes constraints; (4) CoOrg, principal authors of two papers are from
the same organization; (5) citation, one publication cites the other; (6) τ-CoAuthor, two of the co-authors (one from
each publication) are not same but they appear in another publication as co-authors. They applied Hidden Markov
Random Fields for AND on AMiner1 data. Their model combines six types of constraints with Euclidean distance and
facilitates the user to refine the results.
Wang et al. [111] proposed a two-step semi-supervised method for AND that resolves name sharing problem only for
identical names in AMiner2. They propose atomic clusters, i.e., each cluster has the publications of a particular author.
At first step, they use a bias classifier to find the atomic clusters. They use a list of publications having the ambiguous
author name and triplet attributes of the publications as input to the classifier. In the second step, they integrate the
atomic clustering results into the Hierarchical and K-means clustering algorithms.
Wang et al. [52] proposed constraint based topic modeling (CbTM) method as an extension of [54]. They assume that
if a pair of publications satisfy a constraint, then both the publications should have more chances to have similar topic
distribution. They combine the original likelihood function of LDA with a set of constraints defined over the attributes
available from the publication’s dataset. Thus, the likelihood function is also affected by the constraints. They define
the constraints as set of constraint functions each having value either 0 or 1. The presence of a constraint in the pair
of publications under consideration means the function has value 1 otherwise 0. They define five constraints; two of
1 http://AMiner.org
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 100
them belong to triplet attributes excluding the title attribute and other three are: indirect co-author or transitive co-
author (it is actually the τ-CoAuthor constraint defined in [54]); web constraint (it means that two publications appear
in the same web page) and user feedback (what the users comment about two publication’s authors). In the end,
agglomerative hierarchical clustering algorithm is employed to construct clusters to uniquely identify authors
containing all their publications.
Shu et al. [40] proposed LDA-dual topic model for complete entity resolution. They categorize AND into three types:
name sharing, name variant, and name mixing. They introduce the concept of global information based on the words
and author names present in the dataset. In LDA-dual they define topics as two Dirichlet distributions, one over words
and the other over author names, characterizing topics as a series of words and author names. They also consider local
information like paper titles and co-authors. Along with triplet attributes they use topic similarity and minimum name
distance. They claim that two publications share little local information as compared to that of global information and
employed Metropolis-Hasting within Gibbs sampling to calculate the global information i.e., model hyperparameters:
α, β, and γ. The complete process consisted of following steps: (1) find topics of publication in the dataset using Gibbs
sampling; (2) construct a pair-wise classifier of two publications; (3) resolve name sharing problem with the help of
spectral clustering and classifier’s support for each ambiguous author name; (4) solve the name variant and name
mixing problem with help of the classifier.
Ferreira et al. [58] proposed Self-training Associative Name Disambiguation, a hybrid name disambiguation method.
In the first (unsupervised) step clusters of authorship, records are formed utilizing persistent patterns in the co-
authorship graph. In the second (supervised) step training is performed through a subset of clusters constructed in the
first step deriving the disambiguation function.
Arif et al. [112] proposed an enhanced version of the vector space model for AND in digital libraries. Along with the
normal authorship attributes, they added the additional information from the paper’s metadata, including email ID,
affiliation of authors, and co-authors as well. These additional features have greatly improved the performance of the
method. Table 7 shows the summary of name disambiguation methods that involve semi-supervised learning.
TABLE 7
SUMMARY OF SEMI-SUPERVISED LEARNING METHODS
Reference
#
Problem Tool / Method Attributes /
features
Comparison
with
Dataset Finding Limitation
On et al.
[53] 2005
Name variant problem (1) blocking and (2)
distance measurement, 7
supervised and 2
unsupervised algorithms
Co-author
relationships
Four alternatives
using three
representative
metrics
DBLP, e-Print,
biomed,
econpapers
Using coauthor
relation (instead
of author name
alone) shows
improved
scalability and
accuracy
It is a two-
step
approach
and shows
improvement
over one-
step
approach
Lee et al.
[37] 2005
Mixed citations and
split citations
Sampling-based
approximate join
algorithm, 2 supervised
and 5 unsupervised
Associated
information of
author names
Four alternatives
using three
representative
metrics
DBLP, e-Print,
biomed,
econpapers
Improved
accuracy
Accuracy for
e-print is
lower as
compared to
DBLP’s
accuracy
On et el.
[71] 2006
Name variant Graph partition-based
method Quasi-Clique
Contextual
information
mined from the
group of
elements
Quasi-Clique
experimented on
different real and
synthetic
datasets
ACM, biomed,
IMDB
Improves
precision and
recall with
existing ER
solutions
Performance
is better for
IMDB but not
for Citations
data which
has more
strong
connections
as compared
to actors in
IMDB
Huang et
al. [109]
2006
Name sharing, and
name variant problem
LASVM and DBSCAN Author and
papers metadata
Traditional svms Citeseer Improved
efficiency and
effectiveness
--
Zhang et
al. [54]
2007
Name disambiguation Semi-supervised
probabilistic model
6 different
features from
authors and
Blocking and
distance measure
for co-authors
AMiner Improved
scalability and
accuracy
Compared
only with
unsupervised
hierarchical
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 101
citation
information
clustering
methods
Wang et al.
[111] 2008
Name sharing problem Two-step semi-
supervised method
Atomic clusters
with citations of
a particular
author
Hierarchical
clustering and K-
means
AMiner Concept of
atomic clusters
produce better
results. Co-
author features
are important for
atomic clusters
Compared
only with
unsupervised
hierarchical
clustering
methods
Shu et al.
[40] 2009
Name sharing, name
variant and name
mixing
LDA-dual topic model Generative
latent topic
model that
involves both
author names
and words
Experiments on
three different
training data sets
DBLP Improved
accuracy
Smoothing
method for
new words
and author
names does
not scale
Ferreira et
al. [58]
2010
Name disambiguation Self-training Associative
Name Disambiguation
(SAND)
Authorship
records
Two supervised
and two
unsupervised
methods
DBLP, bdbcomp Improved results
as compared to
baselines
More
improvement
when
compared
with
unsupervised
methods as
compared to
the case of
supervised
methods
Wang et al.
[52] 2010
Name sharing problem Constraint based topic
modeling
Combine the
original
likelihood
function of LDA
with a set of
constraints
Hierarchical
clustering
algorithm to
group the papers
into clusters
AMiner Improved
precision, recall
and F1
--
Arif et al.
[112] 2014
Mixed citation and split
citations problem
Enhanced vector space
model
Additional
attributes like e-
mail ID and
affiliation of
author and co-
authors
Comparisons of
real authors
names with
names generated
by proposed
method
IEEE Improved F
measure
Not tested
against any
baseline or
state-of-the-
art
B. GRAPH-BASED METHODS
The graph-based methods are popular for AND. Many authors employ a co-authorship graph to capture the similarity
between two entities. It has been adopted by many methods discussed above, such as relational similarity in
Bhattacharya and Getoor [95] and Yin et al. [36]; inter-object connection strength in Kalashnikov and Mehrotra [113],
Yin et al. [36], and Chen et al. [114]; and semantic association in Jin et al. [115]. The length of the shortest path in a
graph is usually employed to estimate the degree of closeness between two nodes. Kalashnikov and Mehrotra [113]
and Yin et al. [36] utilized connection strength to find the similarity of two nodes connected through relationships.
For this purpose Kalashnikov and Mehrotra [113] exploit legal paths and Fan et al. [43] make use of valid paths.
Bhattacharya and Getoor [95] employed collaboration paths of length three and assign equal weights to all paths
regardless of their length. Kalashnikov and Mehrotra [113] proposed a more complicated method to calculate the
weights for connection strengths. They proposed multiple equations and an iterative method to determine the weights.
Differently, On et al. [71] used Quasi-Clique, a graph mining technique [116] to take advantage of the contextual
similarity in addition to syntactic similarity. On et al. [71], Chen et al. [114] and Jin et al. [115] estimate the similarity
between two nodes (authors) as a combination of the feature-based similarity and the connection strength of the graph.
Chen et al. [114] estimate the connection strength between two nodes as the sum of connection strengths of all simple
paths no longer than a user-defined length.
In the above paragraph, we presented a short but comparative description of some of the graph-based works in AND.
Now the details of each work are discussed. McRae-Spencer and Shadbolt [117] resolved the AND on large-scale
citation networks through graph-based methods exploiting self-citation, co-authorship, and publication source
analyses in three passes to tie the papers of a particular author in a collection assigned to that author. The first pass is
to test each paper in the ambiguous name cluster against every other paper within that cluster to see if the second paper
is the self-citation of the first, or vice versa. Similarly, the second pass is performed to draw a co-authorship graph,
and the third pass used source URL metadata. The output of these three passes is the graphical representation of the
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 102
publications. This method was based on metadata rather than textual context and on the notion that authors cite their
previous publications. This method used self-citation as an attribute so the new papers have fewer or may have no
citations at all. The papers of an author, written just before his/her retirement1 or death will never have self-citations.
Similarly, the papers written just before the change of research area will be self-cited hardly ever.
Galvez and Aneg´on [41] addressed the conflation of personal name variants problem in a standard or canonical form
exploiting finite-state transducers and binary matrices. They divide the variants into valid (the variation among
legitimate variants and canonical forms, e.g., such as the lack of some components of a full name, the absence or use
of punctuation marks, and the use of initials) and non-valid (the variation among non-legitimate variants and correct
forms, e.g., miss-spellings, involving deletions or insertions of characters in the strings, nicknames, abbreviations, and
errors of accentuation in the names from certain languages) categories. They identify and conflate only valid variants
into equivalence classes and canonical forms.
Yin et al. [36] proposed DISTINCT, an object distinction methodology to solve AND, where entities have identical
names. The method combines set resemblance of neighbor tuples and random walk probability (between two records
in the graph of relational data) to measure relational similarity between the records of the relational database. These
two methods are complementary: one exploits the neighborhood information of the two records, and the other uses
connection strength of linkages by assigning weights. DISTINCT exploits several types of linkages, like title, venue,
publisher, year, and author’s affiliation.
Jin et al. [115] proposed Semantic Association AND graphical method. The similarity between the attributes (expect
co-authors) of the two publications is measured through VSM, and the term TF-IDF is applied for term weighting.
For co-authors and transitive co-authors, semantic association graphs are constructed. The nodes show authors, and
the edges show the association. The edges also determine the weight by counting the number of publications co-
authored by two authors. It is a two-step process, RSAC (Related Semantic Association based Clustering) and SAM
(Semantic Association based Merging). RSAC clusters two publications in a group if the co-authorship graphs of the
two publications are similar, i.e., they have common co-authors. Similarly, all the publications are grouped in small
clusters. Transitivity property may hold true for co-authors of some publications, but RSAC does not handle it, and
all the publications of an author may be assigned to multiple groups. To handle this issue SAM merges the groups
based on similarity values calculated for literature (titles + abstracts), affiliations, and transitive co-authorship graphs.
Fan et al. [43] resolved name sharing problems through GHOST (GrapHical framewOrk for name diSambiguaTion)
using only co-authorship attributes, however for dense authors they exploited user feedback too. Contrary to the
methods of Chen et al. [114] and Jin et al. [115], GHOST does not take into account the feature-based similarity, and
the connection strength between nodes u and v is measured using Ohm’s Law-like formula defined over a subset of
valid paths. Another difference of this work from the work in [115] is that it does not model the transitive co-authorship
graph. This work has two strengths. First, the time complexity is very low as compared to the previous works as it
exploits only co-author attribute and achieves 94% precision on average. Second, GHOST employs Ohm’s Law-like
formula to compute the similarity between any pair of nodes in a co-authorship graph. The drawback of GHOST is
that the results for dense authors are not in line with the results of non-dense authors. Fan et al. [43] proposed user
feedback for such authors. No doubt the results are improved but the scalability is a challenge here because in real life
databases there may be thousands of dense authors.
Wang et al. [87] proposed active user name disambiguation (ADANA) exploiting a pair-wise factor graph (PFG)
model which can automatically determine the number of distinct names. Based on PFG model, they introduce a
disambiguation algorithm that improves performance through user interaction.
Shin et al. [118] proposed a graph based model called Graph Framework for Author Disambiguation (GFAD), which
involves co-author relations while constructing graphs and ambiguity is removed by vertex splitting and merging
based on the co-authorship. Table 8 provides a summary of methods that involve the use of graph-based models.
TABLE 8
SUMMARY OF GRAPH BASED METHODS
Reference # Problem Tool / Method Attributes /
features
Comparison with Dataset Fi nding Limitation
Mcrae-Spencer
and Shadbolt
[117] 2006
Name
disambiguation
Citation graph Self-citation, co-
authorship
And document
source analyses
Precision, recall an
df1 for 8 name
based clusters
Citeseer Slightly improved
results in terms of
usefulness
Needs to
create
correction
facility within
1 By the term “retirement” we do not mean the retirement from job rather we mean retirement from research work
willingly or unwillingly due to any reason.
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 103
some tested
services
Galvez and
Aneg´on [41]
2007
Personal name
variants problem
Standard or
canonical form
exploiting finite-
state transducers
and binary
matrices
Author names Application of
master gr aph to
the lists of author
indexes
LISA, SCI-E. Improved
precision, Recall
and F1, reduced
erroneous
analysis
Similarity
measures
needs
improvement
in terms of
error
margins
Jin et al. [115]
2009
Name
disambiguation
Semantic
Association based
Name
Disambiguation
method (SAND),
Semantic
association
graphs
DISTINCT [36],
aktiveauthor [117]
Citesseer, DBLP,
Libra
Improved
accuracy
--
Fan et al. [43]
2011
Name
disambiguation
Graphical
framework for
name
disambiguation
(GHOST)
Feature-based
similarity, and the
connection
strength between
nodes based on
co
-
authorship
2 labeled authors
for DBLP and 8
labeled authors for
pubmed for
comparison,
DISTINCT
[36]
DBLP, pubmed High precision
and recall
Performance
May suffer for
rare dense
authors
Wang et al. [87]
2011
Active name
disambiguation
ADANA using pair-
wise factor graph
Active user
interactions
4 baseline
methods
Publication data
set, a web page
data
Set, and a news
page data set
Reduced error
rate
Error rate has
been
decreased
with the help
of user
corrections
Shin et al. [118]
2014
Namesake
problem
Graph Framework
for Author
Disambiguation
Co-author
relations
3 representative
unsupervised
methods
DBLP, AMiner Improved
performance
--
C. ONTOLOGY-BASED METHODS
In information science, ontology is basically the knowledge of concepts and the relationships between those concepts
within a domain. In other words, it is knowledge representation of a domain. Ontology-based AND has been exploited
by many researchers in different fields. For example, Geographic Named Entity Disambiguation [119], Identity
Resolution Framework (IdRF) [120], Named Entity Disambiguation exploiting Wikipedia [121], [122], Entity Co-
reference [92]. As far as digital libraries or BD are concerned, researchers paid little attention to this kind of methods.
Initially, Hassell et al. [123] resolved AND through already populated ontology extracted from the DBLP. They utilize
a file from DBLP that contains entities like authors, conferences, and journals, and convert it into RDF and used it as
background knowledge. Their method takes a set of documents from DBWorld1 posts, “call for papers” to
disambiguate the authors. Each such document contains multiple authors, say, the committee members, and some
information about them, like affiliation, and information about the venue like topics of the venue. The scenario of the
method is different from those we have discussed throughout this article. All other approaches perform disambiguation
by either predicting the most probable author of a publication or by grouping the publications of the same author in a
unique cluster in BD. Different from those, this method pinpoints, with high accuracy, the correct author in the DBLP
ontology file that a document of DBWorld refers to. Their method selects an author name from the document and
searches the candidate authors in the populated ontology in RDF form. All the candidate authors are compared with
the author in the document to predict the most confident author in the ontology that relates to the author in the
document. Different types of relationships in the ontology are exploited to predict the correct author out of various
matches (candidates) in the ontology. These relationships include entity name, text proximity, text co-occurrence,
popular entities, and semantic relationships. Name entity refers to specifying which entities from the populated
ontology are to be spotted in the text of the document and later disambiguated as all the entities of the document may
not present in the DBLP ontology. Text proximity is the number of space characters between the name entity and the
known affiliation. Here known affiliation means the object already known by the ontology as affiliation, say, name of
a university. In DBWorld postings, affiliations are usually written next to the entity name. If an entity name in the
document and the affiliation matches the author name and known affiliation in the ontology, there are chances that
these two entities are the same real-world entity. Text co-occurrence is utilized to match the research areas of the
candidate authors in the ontology and the topics of the venue present in the posting. A popular entity is an author in
the ontology that has the highest score of publications among the candidate authors. Semantic relationships are used
to match the co-authors of the candidate authors in the ontology and the entities in the document, with a notion that
the entities on a document may be related to one another through any means, maybe co-authors of some publications.
Park and Kim [82] proposed OnCu System to resolve name sharing problem through ontology-based category utility.
The term category utility is used for similarity measurement between two entities. They exploit two types of ontology:
author ontology, built on the publications from several proceedings of conferences, and the computer science domain
ontology. Different from Hassell et al. [123] they determine the correct author from various candidate authors in the
author ontology by exploiting the domain ontology for estimating the semantic similarity. Their goal is to discover
1 DBWorld. http://www.cs.wisc.edu/dbworld/ April 9, 2006
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 104
the right author of the input publication and his/her right homepage. Their method also differs from that of Hassel et
al. [123] in using ontology-based evaluation functions. OnCU views candidate authors as clusters of their publications
and employs a cluster-based evaluation function exploiting ontology to predict the right author out of multiple
candidate authors. The ontology-based approaches provided better semantic similarity measures for different
attributes, but this is fruitful only if the ontologies providing background knowledge are carefully constructed and
frequently revised to meet the dynamic nature of the digital libraries. Table 9 provides a quick summary of
disambiguation based that utilize the domain ontology.
TABLE 9
SUMMARY OF ONTOLOGY-BASED METHODS
Reference # Problem Tool / Method Attributes /
features
Comparison with Dataset Finding Limitation
Hassell et al.
[123] 2006
Entity
disambiguation
Ontology-driven
method
Background
knowledge
(authors,
conferences, and
journals
)
Different types of
relationships in the
ontology are
exploited
Ontology from
DBLP, corpus
from dbworld
Successful use of
large, populated
ontology
Needs to be
tested on
more
Robust
platform
s
Park and Kim
[82] 2008
Name sharing
problem
Oncu, ontology-
based category
utility
Author ontology,
Computer science
domain ontology
Evaluation based
on category
Utility over the
created ambiguity
dataset
Collected papers
from AAAI, ISWC,
ESWC,
And WWW
conferences
websites.
Improved
performance
Cannot
consider
property
Relations
VII. PERFORMANCE EVALUATION
Accuracy, precision, recall and F-measure are the common performance metrics used to evaluate AND methods [29],
[39], [40], [43], [52], [54], [70], [87], [101]. The performance of method used is either measured in terms of the
number of publications correctly predicted or the number of authors correctly predicted. In literature, the performance
measurement terms are defined in a variety of ways. Here we shortly describe the common notion of these terms:
A. ACCURACY
Accuracy (disambiguation accuracy) is the generic term used to represent performance in terms of correctness. It may
be defined in any way that best suits the proposed method. It may be equivalent to precision, recall, and F-measure.
The term accuracy is defined and used by several researchers [37], [42], [51], [57]. For example, Han et al. [51]
defined disambiguation accuracy as “the percentage of the query names correctly predicted”, whereas Han et al. [57]
defined it as “the sum of diagonal elements divided by the total number of elements in the confusion matrix”. Both
these definitions describe the accuracy in terms of correctly predicted authors rather than the correctly predicted
publications of an author.
B. PRECISION
It is the ratio between the number of correctly predicted publications of author ai and the number of publications
predicted as ai’s publications.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = .   [ ∩]
.   {}− − − − − − − (7)
where, 𝑃 = publications of author ai and 𝑃′ = publications predicted as author ai’s. Suppose author ai has publications
{P1-P5}; and the system predicted publications of author ai are {P1-P4, P6, P7}. By applying Eq. 7:
Precision = 4/6 = 0.67
C. RECALL
It is the ratio between the number of correctly predicted publications of author ai and number of ai’s publications.
𝑅𝑒𝑐𝑎𝑙𝑙 = .   [∩]
.    {}− − − − − − − (8)
where, 𝑃 = Publications of author ai and 𝑃′ = Publications predicted as author ai’s. By considering the above example
using Eq. 8:
Recall = 4/5 = 0.8
D. F-MEASURE
It is the harmonic mean of precision and recall.
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 105
𝐻 = 𝑛
1
𝑥

− − − − − − − (9)
By considering the above example using Eq. 9:
F-measure =
.
.=
(..)= 0.73
The above metrics can also be defined on the cluster level too [58]. Cluster precision is the fraction of correct clusters
to the number of clusters acquired by the method, and cluster recall is the fraction of true clusters to that of the method,
and cluster F-measure is the harmonic mean of both [58].
VIII. FUTURE DIRECTIONS AND RECOMMENDATIONS
Although a lot of research work has been performed in this field yet there is a need for a lot of improvement. Many
attempts have been made to assign a unique author ID to each author to resolve the name disambiguation, but these
methods could not gain the attention of the researchers due to many reasons as we have discussed in Section 2. Many
researchers emphasize exploiting more and more attributes to estimate the maximum similarity among the citations.
This causes two issues: first, the time complexity of the algorithm increases, and resultantly scalability is inversely
affected; second, the availability of numerous features for each citation becomes almost impossible. Besides these
issues assigning weight and fixing threshold values to each feature are the bottleneck, especially when the feature set
becomes large. We recommend exploiting only those features that are usually available in the BD so that a general
framework applicable to most of them can be proposed. To resolve the AND problem in a better way we suggest a
few directions below that may help improve the performance:
1. Semantics play an important role in co-author networks [45]–[47]. WordNet1 captures structured semantics
of words and can be exploited for AND in BD to achieve more accurate results through ontologies [56,97].
We propose to use multi-gram topic models besides the unigrams of words for topics distribution over words.
In this way, the natural syntactic relationship among the words is preserved and author writing habits can
become useful for AND. These suggestions can be useful as they consider semantics and can provide better
similarity estimation among the citations.
2. In literature, the transitivity issue is addressed only for the co-authors attribute. We suggest leveraging this
concept for title and venue attributes too.
3. Instead of simply matching the titles of the publications, the references of the two publications to find the
similarity between the two publications can also be exploited.
4. Most of the methods while handling the venue attribute use only its title. We suggest considering the ranking
of the publication venues too. Based on this ranking, the REsearch Ability Level (REAL) of a researcher can
be estimated. The REAL value may help predict the correct author as authors with the same names might
have different rank research publications. All these measurements help improve similarity metrics.
5. The change of the research domain of an author is common these days due to overlaps between different
fields. We suggest constructing sub-clusters within a cluster associated with a particular author. Each sub-
cluster can differ from those of others based on multiple topics of interest of the author.
6. The advisor-advisee relationship can also be identified first to develop hierarchies for authors. As a result,
the authors who are not the same will become nodes of distinct branches of a tree.
IX. CONCLUSIONS
In this survey, we presented a detailed study of the AND methods for DB. Key challenges are highlighted and a generic
framework is proposed, which is quite intuitive and applicable. A lot of work has been done for name variant and
name sharing problems separately, but few efforts are made to deal with both simultaneously which needs more
attention. Different types of methods, such as supervised, up-supervised, semi-supervised, graph-based, and ontology-
based provided elegant solutions for AND, still, graph-based and ontology-based methods need to be explored
1 http://wordnet.princeton.edu/
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 106
exhaustively. In the end, we have highlighted the major issues and future directions in this field. These future directions
and open challenges can give a quick start to future researchers who are interested to research this area.
In this study, we presented a snapshot of research work done about AND in BD, methods applied, and future
challenges around the time of its writing. However, we do believe that the fundamental information, methods, future
directions, and open challenges presented here will be useful for the researchers in this area of research now and in
the future to get a quick start.
ACKNOWLEDGEMENT
We are grateful to the Higher Education Commission (HEC) of Pakistan for their financial assistance to promote the
research trend in the country under the Indigenous 5000 Fellowship Program.
REFERENCES
[1] T. Amjad et al., “Standing on the shoulders of giants,” J. Informetr., vol. 11, no. 1, Art. no. 1, 2017.
[2] J. C. Chen, J. Z. Shyu, and C.-Y. Huang, “EVALUATING KNOWLEDGE DIFFUSION CAPABILITIES OF HIGHER EDUCATION
INSTITUTES BY USING THE DEA,” Glob. J. Res. Anal., vol. 6, no. 10, Art. no. 10, 2018.
[3] R. H. Gálvez, “Assessing author self-citation as a mechanism of relevant knowledge diffusion,” Scientometrics, vol. 111, no. 3, Art. no. 3,
2017.
[4] T. Amjad and A. Ali, “Uncovering diffusion trends in computer science and physics publications,” Libr. Hi Tech, vol. 37, no. 4, Art. no. 4,
2019.
[5] F. Haneef et al., “Using network science to understand the link between subjects and professions,” Comput. Hum. Behav., vol. 106, p.
106228, 2020.
[6] A. Daud et al., “Finding Rising Stars in Bibliometric Networks,” Scientometrics, pp. 1–20, 2020.
[7] T. Amjad, A. Daud, S. Khan, R. A. Abbasi, and F. Imran, “Prediction of Rising Stars from Pakistani Research Communities,” in 2018 14th
International Conference on Emerging Technologies (ICET), 2018, pp. 1–6.
[8] A. Daud, F. Abbas, T. Amjad, A. A. Alshdadi, and J. S. Alowibdi, “Finding rising stars through hot topics detection,” Future Gener.
Comput. Syst., vol. 115, pp. 798–813, 2021.
[9] T. Amjad, Y. Rehmat, A. Daud, and R. A. Abbasi, “Scientific impact of an author and role of self-citations,” Scientometrics, vol. 122, no.
2, pp. 915–932, 2020.
[10] X. Bai, I. Lee, Z. Ning, A. Tolba, and F. Xia, “The Role of Positive and Negative Citations in Scientific Evaluation,” IEEE Access, vol. 5,
pp. 17607–17617, 2017.
[11] A. Daud, T. Amjad, M. A. Siddiqui, N. R. Aljohani, R. A. Abbasi, and M. A. Aslam, “Correlational analysis of topic specificity and citations
count of publication venues,” Libr. Hi Tech, 2019.
[12] F. González-Sala, J. Osca-Lluch, and J. Haba-Osca, “Are journal and author self-citations a visibility strategy?,” Scientometrics, vol. 119,
no. 3, Art. no. 3, 2019.
[13] M. K. Hayat et al., “Towards Deep Learning Prospects: Insights for Social Media Analytics,” IEEE Access, vol. 7, pp. 36958–36979, 2019.
[14] D. Bunker, S. Stieglitz, C. Ehnis, and A. Sleigh, “Bright ICT: Social Media Analytics for Society and Crisis Management,” in International
Working Conference on Transfer and Diffusion of IT, 2019, pp. 536–552.
[15] Y.-C. Chang, C.-H. Ku, and C.-H. Chen, “Social media analytics: Extracting and visualizing Hilton hotel ratings and reviews from
TripAdvisor,” Int. J. Inf. Manag., vol. 48, pp. 263–279, 2019.
[16] Z. Saeed et al., “What’s happening around the world? A survey and framework on event detection techniques on twitter,” J. Grid Comput.,
vol. 17, no. 2, pp. 279–312, 2019.
[17] M. S. Faisal, A. Daud, A. U. Akram, R. A. Abbasi, N. R. Aljohani, and I. Mehmood, “Expert ranking techniques for online rated forums,”
Comput. Hum. Behav., vol. 100, pp. 168–176, 2019.
[18] N. Nikzad–Khasmakhi, M. A. Balafar, and M. R. Feizi–Derakhshi, “The state-of-the-art in expert recommendation systems,” Eng. Appl.
Artif. Intell., vol. 82, pp. 126–147, 2019.
[19] T. Amjad, A. Daud, and N. R. Aljohani, “Ranking authors in academic social networks: a survey,” Libr. Hi Tech, vol. 36, no. 1, Art. no. 1,
2018.
[20] T. Amjad, A. Daud, D. Che, and A. Akram, “MuICE: Mutual Influence and Citation Exclusivity Author Rank,” Inf. Process. Manag., 2015.
[21] T. Amjad, A. Daud, A. Akram, and F. Muhammed, “Impact of mutual influence while ranking authors in a co-authorship network,” Kuwait
J. Sci., vol. 43, no. 3, 2016, Accessed: Oct. 03, 2016. [Online]. Available: http://journals.ku.edu.kw/kjs/index.php/KJS/article/view/941
[22] T. Amjad and A. Daud, “Indexing of authors according to their domain of expertise,” Malays. J. Libr. Inf. Sci., vol. 22, no. 1, Art. no. 1,
2017.
[23] T. Amjad, Y. Ding, A. Daud, J. Xu, and V. Malic, “Topic-based heterogeneous rank,” Scientometrics, vol. 104, no. 1, Art. no. 1, 2015.
[24] T. Amjad, “Domain-Specific Scientific Impact and its Prediction,” in 2021 International Conference on Artificial Intelligence (ICAI), 2021,
pp. 16–21.
[25] C. Laorden, I. Santos, B. Sanz, G. Alvarez, and P. G. Bringas, “Word sense disambiguation for spam filtering,” Electron. Commer. Res.
Appl., vol. 11, no. 3, pp. 290–298, 2012.
[26] J. Miró-Borrás and P. Bernabeu-Soler, “Text entry in the e-commerce age: two proposals for the severely handicapped,” J. Theor. Appl.
Electron. Commer. Res., vol. 4, no. 1, pp. 101–112, 2009.
[27] D. Chen, X. Li, Y. Liang, and J. Zhang, “A semantic query approach to personalized e-Catalogs service system,” J. Theor. Appl. Electron.
Commer. Res., vol. 5, no. 3, pp. 39–54, 2010.
[28] Y. Song, J. Huang, I. G. Councill, J. Li, and C. L. Giles, “Efficient topic-based unsupervised name disambiguation,” in Proceedings of the
7th ACM/IEEE-CS joint conference on Digital libraries, 2007, pp. 342–351. Accessed: Oct. 07, 2016. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1255243
[29] J. Tang, A. C. Fong, B. Wang, and J. Zhang, “A unified probabilistic framework for name disambiguation in digital library,” IEEE Trans.
Knowl. Data Eng., vol. 24, no. 6, pp. 975–987, 2012.
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 107
[30] M. Ley, “The DBLP computer science bibliography: Evolution, research issues, perspectives,” in International Symposium on String
Processing and Information Retrieval, 2002, pp. 1–10. Accessed: Oct. 06, 2016. [Online]. Available:
http://link.springer.com/chapter/10.1007/3-540-45735-6_1
[31] C. L. Giles, K. D. Bollacker, and S. Lawrence, “CiteSeer: An automatic citation indexing system,” in Proceedings of the third ACM
conference on Digital libraries, 1998, pp. 89–98. Accessed: Oct. 06, 2016. [Online]. Available: http://dl.acm.org/citation.cfm?id=276685
[32] N. R. Smalheiser and V. I. Torvik, “Author name disambiguation,” Annu. Rev. Inf. Sci. Technol., vol. 43, no. 1, pp. 1–43, 2009.
[33] D. K. Sanyal, P. K. Bhowmick, and P. P. Das, “A review of author name disambiguation techniques for the PubMed bibliographic database,”
J. Inf. Sci., vol. 47, no. 2, pp. 227–254, 2021.
[34] C. L. Giles, H. Zha, and H. Han, “Name disambiguation in author citations using a k-way spectral clustering method,” in Proceedings of
the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’05), 2005, pp. 334–343. Accessed: Oct. 06, 2016. [Online]. Available:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4118563
[35] B. Malin, “Unsupervised name disambiguation via social network similarity,” in Workshop on link analysis, counterterrorism, and security,
2005, vol. 1401, pp. 93–102. Accessed: Oct. 06, 2016. [Online]. Available: http://www.siam.org/meetings/sdm05/sdm-link-
analysis.zip#page=97
[36] X. Yin, J. Han, and S. Y. Philip, “Object distinction: Distinguishing objects with identical names,” in 2007 IEEE 23rd International
Conference on Data Engineering, 2007, pp. 1242–1246. Accessed: Oct. 07, 2016. [Online]. Available:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4221773
[37] D. Lee, B.-W. On, J. Kang, and S. Park, “Effective and scalable solutions for mixed and split citation problems in digital libraries,” in
Proceedings of the 2nd international workshop on Information quality in information systems, 2005, pp. 69–76. Accessed: Apr. 12, 2016.
[Online]. Available: http://dl.acm.org/citation.cfm?id=1077514
[38] Y. F. Tan, M. Y. Kan, and D. Lee, “Search engine driven author disambiguation,” in Proceedings of the 6th ACM/IEEE-CS joint conference
on Digital libraries, 2006, pp. 314–315. Accessed: Apr. 12, 2016. [Online]. Available: http://dl.acm.org/citation.cfm?id=1141826
[39] I. Bhattacharya and L. Getoor, “A Latent Dirichlet Model for Unsupervised Entity Resolution.,” in SDM, 2006, vol. 5, p. 59. Accessed:
Oct. 06, 2016. [Online]. Available: http://epubs.siam.org/doi/abs/10.1137/1.9781611972764.5
[40] L. Shu, B. Long, and W. Meng, “A latent topic model for complete entity resolution,” in Data Engineering, 2009. ICDE’09. IEEE 25th
International Conference on, 2009, pp. 880–891. Accessed: Apr. 12, 2016. [Online]. Available:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4812462
[41] C. Galvez and F. Moya-Anegón, “Approximate personal name-matching through finite-state graphs,” J. Am. Soc. Inf. Sci. Technol., vol.
58, no. 13, pp. 1960–1976, 2007.
[42] P. Treeratpituk and C. L. Giles, “Disambiguating authors in academic publications using random forests,” in Proceedings of the 9th
ACM/IEEE-CS joint conference on Digital libraries, 2009, pp. 39–48. Accessed: Oct. 07, 2016. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1555408
[43] X. Fan, J. Wang, X. Pu, L. Zhou, and B. Lv, “On graph-based name disambiguation,” J. Data Inf. Qual. JDIQ, vol. 2, no. 2, p. 10, 2011.
[44] L. K. Branting, “A comparative evaluation of name-matching algorithms,” in Proceedings of the 9th international conference on Artificial
intelligence and law, 2003, pp. 224–232. Accessed: Apr. 12, 2016. [Online]. Available: http://dl.acm.org/citation.cfm?id=1047837
[45] A. Daud, “Using time topic modeling for semantics-based dynamic research interest finding,” Knowl.-Based Syst., vol. 26, pp. 154–163,
2012.
[46] A. Daud and F. Muhammad, “Group topic modeling for academic knowledge discovery,” Appl. Intell., vol. 36, no. 4, Art. no. 4, 2012.
[47] A. Daud, J. Li, L. Zhou, and F. Muhammad, “Temporal expert finding through generalized time topic modeling,” Knowl.-Based Syst., vol.
23, no. 6, Art. no. 6, 2010.
[48] A. Daud, J. Li, L. Zhou, and F. Muhammad, “Knowledge discovery through directed probabilistic topic models: a survey,” Front. Comput.
Sci. China, vol. 4, no. 2, pp. 280–301, 2010.
[49] D. A. Dervos, N. Samaras, G. Evangelidis, J. Hyvärinen, and Y. Asmanidis, “The universal author identifier system (UAI_Sys),” 2006,
Accessed: Oct. 07, 2016. [Online]. Available: http://arizona.openrepository.com/arizona/handle/10150/105755
[50] A. M. Ketchum, “ORCID,” 2014, Accessed: Oct. 07, 2016. [Online]. Available:
http://nnlm.gov/sites/default/files/migrated/file/3f8237005268a231622eafdda40c9a49.pdf
[51] H. Han, L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis, “Two supervised learning approaches for name disambiguation in author citations,”
in Digital Libraries, 2004. Proceedings of the 2004 joint ACM/IEEE conference on, 2004, pp. 296–305. Accessed: Oct. 06, 2016. [Online].
Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1336139
[52] F. Wang, J. Tang, J. Li, and K. Wang, “A constraint-based topic modeling approach for name disambiguation,Front. Comput. Sci. China,
vol. 4, no. 1, pp. 100–111, 2010.
[53] B.-W. On, D. Lee, J. Kang, and P. Mitra, “Comparative study of name disambiguation problem using a scalable blocking-based framework,”
in Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries, 2005, pp. 344–353. Accessed: Oct. 06, 2016. [Online].
Available: http://dl.acm.org/citation.cfm?id=1065463
[54] D. Zhang, J. Tang, J. Li, and K. Wang, “A constraint-based probabilistic framework for name disambiguation,” in Proceedings of the
sixteenth ACM conference on Conference on information and knowledge management, 2007, pp. 1019–1022. Accessed: Oct. 07, 2016.
[Online]. Available: http://dl.acm.org/citation.cfm?id=1321600
[55] V. I. Torvik, M. Weeber, D. R. Swanson, and N. R. Smalheiser, “A probabilistic similarity metric for Medline records: A model for author
name disambiguation,” J. Am. Soc. Inf. Sci. Technol., vol. 56, no. 2, pp. 140–158, 2005.
[56] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, 2003.
[57] H. Han, W. Xu, H. Zha, and C. L. Giles, “A hierarchical naive Bayes mixture model for name disambiguation in author citations,” in
Proceedings of the 2005 ACM symposium on Applied computing, 2005, pp. 1065–1069. Accessed: Oct. 06, 2016. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1066920
[58] A. A. Ferreira, A. Veloso, M. A. Gonçalves, and A. H. Laender, “Effective self-training author name disambiguation in scholarly digital
libraries,” in Proceedings of the 10th annual joint conference on Digital libraries, 2010, pp. 39–48. Accessed: Oct. 07, 2016. [Online].
Available: http://dl.acm.org/citation.cfm?id=1816130
[59] B.-W. On and D. Lee, “Scalable Name Disambiguation using Multi-level Graph Partition.,” in SDM, 2007, pp. 575–580. Accessed: Oct.
07, 2016. [Online]. Available: http://epubs.siam.org/doi/abs/10.1137/1.9781611972771.64
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 108
[60] K.-H. Yang, H.-T. Peng, J.-Y. Jiang, H.-M. Lee, and J.-M. Ho, “Author name disambiguation for citations using topic and web correlation,”
in International Conference on Theory and Practice of Digital Libraries, 2008, pp. 185–196. Accessed: Oct. 07, 2016. [Online]. Available:
http://link.springer.com/chapter/10.1007/978-3-540-87599-4_19
[61] P. Reuther, “Personal name matching: New test collections and a social network based approach,” Comput. Sci. Tech. Rep., pp. 06–01,
2006.
[62] S. Pandit, S. Gupta, and others, “A comparative study on distance measuring approaches for clustering,” Int. J. Res. Comput. Sci., vol. 2,
no. 1, pp. 29–31, 2011.
[63] W. Cohen, P. Ravikumar, and S. Fienberg, “A comparison of string metrics for matching names and records,” in Kdd workshop on data
cleaning and object consolidation, 2003, vol. 3, pp. 73–78. Accessed: Oct. 07, 2016. [Online]. Available:
https://www.cs.cmu.edu/afs/cs/Web/People/wcohen/postscript/kdd-2003-match-ws.pdf
[64] A. E. Monge, C. Elkan, and others, “The Field Matching Problem: Algorithms and Applications.,” in KDD, 1996, pp. 267–270. Accessed:
Oct. 07, 2016. [Online]. Available: http://www.aaai.org/Papers/KDD/1996/KDD96-044.pdf
[65] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison, Biological sequence analysis: probabilistic models of proteins and nucleic acids.
Cambridge university press, 1998.
[66] M. A. Jaro, “Probabilistic linkage of large public health data files,” Stat. Med., vol. 14, no. 5–7, pp. 491–498, 1995.
[67] W. E. Winkler, “The state of record linkage and current research problems,” 1999. Accessed: Oct. 07, 2016. [Online]. Available:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.39.4336
[68] Y. Chen and J. Martin, “Towards Robust Unsupervised Personal Name Disambiguation.,” in EMNLP-CoNLL, 2007, pp. 190–198.
Accessed: Oct. 07, 2016. [Online]. Available:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.228.7132&rep=rep1&type=pdf#page=224
[69] L. Jin, C. Li, and S. Mehrotra, “Efficient record linkage in large data sets,” in Database Systems for Advanced Applications, 2003.(DASFAA
2003). Proceedings. Eighth International Conference on, 2003, pp. 137–146. Accessed: Oct. 07, 2016. [Online]. Available:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1192377
[70] R. Bekkerman and A. McCallum, “Disambiguating web appearances of people in a social network,” in Proceedings of the 14th international
conference on World Wide Web, 2005, pp. 463–470. Accessed: Oct. 06, 2016. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1060813
[71] B.-W. On, E. Elmacioglu, D. Lee, J. Kang, and J. Pei, “Improving grouped-entity resolution using quasi-cliques,” in Sixth International
Conference on Data Mining (ICDM’06), 2006, pp. 1008–1015. Accessed: Oct. 07, 2016. [Online]. Available:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4053144
[72] G. Salton, A. Wong, and C.-S. Yang, “A vector space model for automatic indexing,” Commun. ACM, vol. 18, no. 11, pp. 613–620, 1975.
[73] T. Hofmann, “Probabilistic latent semantic analysis,” in Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence,
1999, pp. 289–296. Accessed: Oct. 07, 2016. [Online]. Available: http://dl.acm.org/citation.cfm?id=2073829
[74] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proc. Natl. Acad. Sci., vol. 101, no. suppl 1, pp. 5228–5235, 2004.
[75] M. A. Hernández and S. J. Stolfo, “The merge/purge problem for large databases,” in ACM Sigmod Record, 1995, vol. 24, pp. 127–138.
Accessed: Oct. 06, 2016. [Online]. Available: http://dl.acm.org/citation.cfm?id=223807
[76] H. L. Dunn, “Record linkage*,” Am. J. Public Health Nations Health, vol. 36, no. 12, pp. 1412–1416, 1946.
[77] D. Bitton and D. J. DeWitt, “Duplicate record elimination in large data files,” ACM Trans. Database Syst. TODS, vol. 8, no. 2, pp. 255–
265, 1983.
[78] K. J. Cios, R. W. Swiniarski, W. Pedrycz, and L. A. Kurgan, “The knowledge discovery process,” in Data Mining, 2007, pp. 9–24. Accessed:
Oct. 06, 2016. [Online]. Available: http://link.springer.com/content/pdf/10.1007/978-0-387-36795-8_2.pdf
[79] W. W. Cohen, H. Kautz, and D. McAllester, “Hardening soft information sources,” in Proceedings of the sixth ACM SIGKDD international
conference on Knowledge discovery and data mining, 2000, pp. 255–259. Accessed: Oct. 06, 2016. [Online]. Available:
http://dl.acm.org/citation.cfm?id=347141
[80] A. Bagga, Coreference, cross-document coreference, and information extraction methodologies. Duke University, 1998. Accessed: Oct.
06, 2016. [Online]. Available: http://dl.acm.org/citation.cfm?id=927251
[81] H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser, “Identity uncertainty and citation matching,” in Advances in neural information
processing systems, 2002, pp. 1401–1408. Accessed: Oct. 06, 2016. [Online]. Available:
http://machinelearning.wustl.edu/mlpapers/paper_files/AP01.pdf
[82] Y.-T. Park and J.-M. Kim, “OnCU system: ontology-based category utility approach for author name disambiguation,” in Proceedings of
the 2nd international conference on Ubiquitous information management and communication, 2008, pp. 63–68. Accessed: Oct. 07, 2016.
[Online]. Available: http://dl.acm.org/citation.cfm?id=1352807
[83] C. L. Scoville, E. D. Johnson, and A. L. McConnell, “When A. Rose is not A. Rose: the vagaries of author searching,” Med. Ref. Serv. Q.,
vol. 22, no. 4, pp. 1–11, 2003.
[84] A. Culotta, P. Kanani, R. Hall, M. Wick, and A. McCallum, “Author disambiguation using error-driven machine learning with a ranking
loss function,” 2007. Accessed: Oct. 07, 2016. [Online]. Available: http://www.aaai.org/Papers/Workshops/2007/WS-07-14/WS07-14-
006.pdf
[85] V. I. Torvik and N. R. Smalheiser, “Author name disambiguation in MEDLINE,” ACM Trans. Knowl. Discov. Data TKDD, vol. 3, no. 3,
p. 11, 2009.
[86] Y. Qian, Y. Hu, J. Cui, Q. Zheng, and Z. Nie, “Combining machine learning and human judgment in author disambiguation,” in Proceedings
of the 20th ACM international conference on Information and knowledge management, 2011, pp. 1241–1246. Accessed: Oct. 07, 2016.
[Online]. Available: http://dl.acm.org/citation.cfm?id=2063756
[87] X. Wang, J. Tang, H. Cheng, and S. Y. Philip, “Adana: Active name disambiguation,” in 2011 IEEE 11th International Conference on
Data Mining, 2011, pp. 794–803. Accessed: Oct. 07, 2016. [Online]. Available:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6137284
[88] V. Vapnik, The nature of statistical learning theory. Springer Science & Business Media, 2013. Accessed: Oct. 06, 2016. [Online].
Available:
https://books.google.com.pk/books?hl=en&lr=&id=EqgACAAAQBAJ&oi=fnd&pg=PR7&dq=The+Nature+of+Statistical+Learning+Th
eory&ots=g2K2mycZ25&sig=ypjXo0ldi0UPQbOmCE-sfUJn4fk
[89] N. Cristianini and J. Shawe-Taylor, An introduction to support vector machines and other kernel-based learning methods. Cambridge
university press, 2000. Accessed: Oct. 06, 2016. [Online]. Available:
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 109
https://books.google.com.pk/books?hl=en&lr=&id=_PXJn_cxv0AC&oi=fnd&pg=PR9&dq=An+Introduction+to+Support+Vector+Mach
ines&ots=xRTi9F3u29&sig=qAt8_84DBLerD35-wG1RmkhA_18
[90] H. Han, H. Zha, and C. L. Giles, “A model-based k-means algorithm for name disambiguation,” 2003. Accessed: Oct. 07, 2016. [Online].
Available: http://ceur-ws.org/Vol-83/int_2.pdf
[91] X. Sun, J. Kaur, L. Possamai, and F. Menczer, “Detecting ambiguous author names in crowdsourced scholarly data,” in Privacy, Security,
Risk and Trust (PASSAT) and 2011 IEEE Third Inernational Conference on Social Computing (SocialCom), 2011 IEEE Third International
Conference on, 2011, pp. 568–571. Accessed: Oct. 07, 2016. [Online]. Available:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6113170
[92] D. T. Hoang, J. Kaur, and F. Menczer, “Crowdsourcing scholarly data,” 2010, Accessed: Oct. 07, 2016. [Online]. Available:
http://journal.webscience.org/321/2/websci10_submission_107.pdf
[93] B. Zhang, M. Dundar, and M. A. Hasan, “Bayesian Non-Exhaustive Classification A Case Study: Online Name Disambiguation using
Temporal Record Streams,” ArXiv Prepr. ArXiv160705746, 2016, Accessed: Oct. 07, 2016. [Online]. Available:
http://arxiv.org/abs/1607.05746
[94] E. Elmacioglu, J. Kang, D. Lee, J. Pei, and B. On, “An effective approach to entity resolution problem using quasi-clique and its application
to digital libraries,” in Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’06), 2006, pp. 51–52. Accessed:
Oct. 07, 2016. [Online]. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4119096
[95] I. Bhattacharya and L. Getoor, “Collective entity resolution in relational data,” ACM Trans. Knowl. Discov. Data TKDD, vol. 1, no. 1, p.
5, 2007.
[96] R. G. Cota, M. A. Gonçalves, and A. H. Laender, “A Heuristic-based Hierarchical Clustering Method for Author Name Disambiguation in
Digital Libraries.,” in SBBD, 2007, pp. 20–34. Accessed: Oct. 07, 2016. [Online]. Available:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.5709&rep=rep1&type=pdf
[97] J. Soler, “Separating the articles of authors with the same name,” Scientometrics, vol. 72, no. 2, pp. 281–290, 2007.
[98] I.-S. Kang et al., “On co-authorship for author disambiguation,” Inf. Process. Manag., vol. 45, no. 1, pp. 84–97, 2009.
[99] D. A. Pereira, B. Ribeiro-Neto, N. Ziviani, A. H. Laender, M. A. Gonçalves, and A. A. Ferreira, “Using web information for author name
disambiguation,” in Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, 2009, pp. 49–58. Accessed: Oct. 07, 2016.
[Online]. Available: http://dl.acm.org/citation.cfm?id=1555409
[100] A. E. Gelfand, “Gibbs sampling,” J. Am. Stat. Assoc., vol. 95, no. 452, pp. 1300–1304, 2000.
[101] D. Shin, T. Kim, H. Jung, and J. Choi, “Automatic method for author name disambiguation using social networks,” in 2010 24th IEEE
International Conference on Advanced Information Networking and Applications, 2010, pp. 1263–1270. Accessed: Oct. 07, 2016. [Online].
Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5474861
[102] D. Shin, J. Kang, J. Choi, and J. Yang, “Detecting collaborative fields using social networks,” in Networked Computing and Advanced
Information Management, 2008. NCM’08. Fourth International Conference on, 2008, vol. 1, pp. 325–328. Accessed: Oct. 07, 2016.
[Online]. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4624027
[103] K.-H. Yang and Y.-H. Wu, “Author name disambiguation in citations,” in Proceedings of the 2011 IEEE/WIC/ACM International
Conferences on Web Intelligence and Intelligent Agent Technology-Volume 03, 2011, pp. 335–338. Accessed: Oct. 07, 2016. [Online].
Available: http://dl.acm.org/citation.cfm?id=2052298
[104] H. Künsch, S. Geman, and A. Kehagias, “Hidden Markov random fields,” Ann. Appl. Probab., pp. 577–602, 1995.
[105] H. Wu, B. Li, Y. Pei, and J. He, “Unsupervised author disambiguation using Dempster–Shafer theory,” Scientometrics, vol. 101, no. 3, pp.
1955–1972, 2014.
[106] Y. Qian, Q. Zheng, T. Sakai, J. Ye, and J. Liu, “Dynamic author name disambiguation for growing digital libraries,” Inf. Retr. J., vol. 18,
no. 5, pp. 379–412, 2015.
[107] M. Khabsa, P. Treeratpituk, and C. L. Giles, “Online person name disambiguation with constraints,” in Proceedings of the 15th ACM/IEEE-
CS Joint Conference on Digital Libraries, 2015, pp. 37–46. Accessed: Oct. 08, 2016. [Online]. Available:
http://dl.acm.org/citation.cfm?id=2756915
[108] C.-C. Sun, D.-R. Shen, Y. Kou, T.-Z. Nie, and G. Yu, “Topological Features Based Entity Disambiguation,” J. Comput. Sci. Technol., vol.
31, no. 5, pp. 1053–1068, 2016.
[109] J. Huang, S. Ertekin, and C. L. Giles, “Efficient name disambiguation for large-scale databases,” in European Conference on Principles of
Data Mining and Knowledge Discovery, 2006, pp. 536–544. Accessed: Oct. 07, 2016. [Online]. Available:
http://link.springer.com/10.1007%2F11871637_53
[110] M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg, “Adaptive name matching in information integration,” IEEE Intell.
Syst., vol. 18, no. 5, pp. 16–23, 2003.
[111] F. Wang, J. Li, J. Tang, J. Zhang, and K. Wang, “Name disambiguation using atomic clusters,” in Web-Age Information Management,
2008. WAIM’08. The Ninth International Conference on, 2008, pp. 357–364. Accessed: Oct. 07, 2016. [Online]. Available:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4597035
[112] T. Arif, R. Ali, and M. Asger, “Author name disambiguation using vector space model and hybrid similarity measures,” in Contemporary
Computing (IC3), 2014 Seventh International Conference on, 2014, pp. 135–140. Accessed: Oct. 07, 2016. [Online]. Available:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6897162
[113] D. V. Kalashnikov and S. Mehrotra, “Domain-independent data cleaning via analysis of entity-relationship graph,” ACM Trans. Database
Syst. TODS, vol. 31, no. 2, pp. 716–767, 2006.
[114] Z. Chen, D. V. Kalashnikov, and S. Mehrotra, “Adaptive graphical approach to entity resolution,” in Proceedings of the 7th ACM/IEEE-
CS joint conference on Digital libraries, 2007, pp. 204–213. Accessed: Oct. 07, 2016. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1255215
[115] H. Jin, L. Huang, and P. Yuan, “Name disambiguation using semantic association clustering,” in e-Business Engineering, 2009. ICEBE’09.
IEEE International Conference on, 2009, pp. 42–48. Accessed: Oct. 07, 2016. [Online]. Available:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5342132
[116] J. Pei, D. Jiang, and A. Zhang, “On mining cross-graph quasi-cliques,” in Proceedings of the eleventh ACM SIGKDD international
conference on Knowledge discovery in data mining, 2005, pp. 228–238. Accessed: Oct. 07, 2016. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1081898
Author Name Disambiguation in Bibliographic Databases: A Survey
Volume 2, Issue 1, Article 9, Pages 87-110, Dec, 2021 110
[117] D. M. McRae-Spencer and N. R. Shadbolt, “Also by the same author: AKTiveAuthor, a citation graph approach to name disambiguation,”
in Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, 2006, pp. 53–54. Accessed: Oct. 07, 2016. [Online].
Available: http://dl.acm.org/citation.cfm?id=1141762
[118] D. Shin, T. Kim, J. Choi, and J. Kim, “Author name disambiguation using a graph model with node splitting and merging based on
bibliographic information,” Scientometrics, vol. 100, no. 1, pp. 15–50, 2014.
[119] J. Kleb and R. Volz, “Ontology based entity disambiguation with natural language patterns,” in Digital Information Management, 2009.
ICDIM 2009. Fourth International Conference on, 2009, pp. 1–8. Accessed: Oct. 07, 2016. [Online]. Available:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5356769
[120] M. Yankova, H. Saggion, and H. Cunningham, “Adopting ontologies for multisource identity resolution,” in Proceedings of the first
international workshop on Ontology-supported business intelligence, 2008, p. 6. Accessed: Oct. 07, 2016. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1452573
[121] H. T. Nguyen and T. H. Cao, “Named entity disambiguation on an ontology enriched by Wikipedia,” in Research, Innovation and Vision
for the Future, 2008. RIVF 2008. IEEE International Conference on, 2008, pp. 247–254. Accessed: Oct. 09, 2016. [Online]. Available:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4586363
[122] H. T. Nguyen and T. H. Cao, “Enriching ontologies for named entity disambiguation,” 2010. Accessed: Oct. 07, 2016. [Online]. Available:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.473.2198&rep=rep1&type=pdf
[123] J. Hassell, B. Aleman-Meza, and I. B. Arpinar, “Ontology-driven automatic entity disambiguation in unstructured text,” in International
Semantic Web Conference, 2006, pp. 44–57. Accessed: Oct. 07, 2016. [Online]. Available:
http://link.springer.com/10.1007%2F11926078_4
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Finding Rising Stars (FRS) is a hot research topic investigated recently for diverse application domains. These days, people are more interested in finding people who will become experts shortly to fill junior positions than finding existing experts who can immediately fill senior positions. FRS can increase productivity wherever they join due to their vibrant and energetic behavior. In this paper, we assess the methods to find FRS. The existing methods are classified into ranking-, prediction-, clustering-, and analysis-based methods, and the pros and cons of these methods are discussed. Details of standard datasets and performance-evaluation measures are also provided for this growing area of research. We conclude by discussing open challenges and future directions in this prosperous area of research.
Article
Full-text available
In bibliometric and scientometric research, the quantitative assessment of scientific impact has boomed over the past few decades. Citations, being playing a major role in enhancing the impact of researchers, have become a very significant part of a plethora of new techniques for measuring scientific impact. Self-citations, though can be used genuinely to credit someone’s own work, can play a significant role in artificial manipulation of scientific impact. In this research, we study the impact of self-citations on enhancing the scientific impact of an author using a dataset retrieved from AMiner ranging from 1936 to 2014 from the computer science domain. We investigated the relations among trends of self-citation and their influence on scientific impact. We also studied its influence on ranking metrics including author impact factor and H-Index. By analyzing self-citations over time, we discover five basic self-citation trends, which are early, middle, later, multi and none. Distinctly different patterns were observed in self-citations trends. The results show that self-citations, if totally removed from total received citations, negatively influence the AIF and H-Index values and hence can be used to artificially boost the scientific impact. We used regression-based prediction models to predict the influence of self-citations on future H-Index. Classifiers including Logistic Regression, Naïve Bayes and K-NN were used with an accuracy of 93%, 73% and 60% respectively.
Article
Full-text available
In the last few years, Twitter has become a popular platform for sharing opinions, experiences, news, and views in real-time. Twitter presents an interesting opportunity for detecting events happening around the world. The content (tweets) published on Twitter are short and pose diverse challenges for detecting and interpreting event-related information. This article provides insights into ongoing research and helps in understanding recent research trends and techniques used for event detection using Twitter data. We classify techniques and methodologies according to event types, orientation of content, event detection tasks, their evaluation, and common practices. We highlight the limitations of existing techniques and accordingly propose solutions to address the shortcomings. We propose a framework called EDoT based on the research trends, common practices, and techniques used for detecting events on Twitter. EDoT can serve as a guideline for developing event detection methods, especially for researchers who are new in this area. We also describe and compare data collection techniques, the effectiveness and shortcomings of various Twitter and non-Twitter-based features, and discuss various evaluation measures and benchmarking methodologies. Finally, we discuss the trends, limitations, and future directions for detecting events on Twitter.
Article
Full-text available
The recent rapid growth of the Internet content has led to building recommendation systems that guide users to their needs through an information retrieving process. An expert recommendation system is an emerging area that attempts to detect the most knowledgeable people in some specific topics. This detection is based on both the extracted information from peoples’ activities and the content of the documents concerned with them. Moreover, an expert recommendation system takes a user topic or query and then provides a list of people sorted by the degree of their relevant expertise with the given topic or query. These systems can be modeled by information retrieval approaches, along with search engines or a combination of natural language processing systems. The following study provides a critical overview of existing expert recommendation systems and their advantages and disadvantages, considering as well different techniques employed by them.
Article
Full-text available
Purpose Citation analysis is an important measure for the assessment of quality and impact of academic entities (authors, papers and publication venues) used for ranking of research articles, authors and publication venues. It is a common observation that high-level publication venues, with few exceptions ( Nature , Science and PLOS ONE ), are usually topic specific. The purpose of this paper is to investigate the claim correlation analysis between topic specificity and citation count of different types of publication venues (journals, conferences and workshops). Design/methodology/approach The topic specificity was calculated using the information theoretic measure of entropy (which tells us about the disorder of the system). The authors computed the entropy of the titles of the papers published in each venue type to investigate their topic specificity. Findings It was observed that venues usually with higher citations (high-level publication venues) have low entropy and venues with lesser citations (not-high-level publication venues) have high entropy. Low entropy means less disorder and more specific to topic and vice versa. The input data considered here were DBLP-V7 data set for the last 10 years. Experimental analysis shows that topic specificity and citation count of publication venues are negatively correlated to each other. Originality/value This paper is the first attempt to discover correlation between topic sensitivity and citation counts of publication venues. It also used topic specificity as a feature to rank academic entities.
Article
Purpose The purpose of this paper is to trace the knowledge diffusion patterns between the publications of top journals of computer science and physics to uncover the knowledge diffusion trends. Design/methodology/approach The degree of information flow between the disciplines is a measure of entropy and received citations. The entropy gives the uncertainty in the citation distribution of a journal; the more a journal is involved in spreading information or affected by other journals, its entropy increases. The citations from outside category give the degree of inter-disciplinarity index as the percentage of references made to papers of another discipline. In this study, the topic-related diffusion across computer science and physics scholarly communication network is studied to examine how the same research topic is studied and shared across disciplines. Findings For three indicators, Shannon entropy, citations outside category (COC) and research keywords, a global view of information flow at the journal level between both disciplines is obtained. It is observed that computer science mostly cites knowledge published in physics journals as compared to physics journals that cite knowledge within the field. Originality/value To the best of the authors’ knowledge, this is the first study that traces knowledge diffusion trends between computer science and physics publications at journal level using entropy, COC and research keywords.
Article
This study is aimed at analysing self-citation as a strategy used by journals and authors regarding first citations in of Latin-American psychology journals between 2012 and 2016. A total of 8977 citations received were analysed for a total of 2403 papers published in the 19 Latin-American psychology journals collected in the 2016 WoS (included in the 2015 JCR edition). The results indicate that there is an effect of the first self-citations on the number of citations, the journal self-citations and the author’s. It is observed that the journal self-citations and first journal self-citations are more important for the journals located in first quartiles, versus author’s self-citations. The importance of the type of self-citation differs between some publications and others, being the journal self-citations those that greater differences present between journals throughout the period studied. The self-consumption of information, according to the number of articles with self-citations, varies between the journals, ranging between 88.8 and 55.8%. It can be concluded that self-citations and first self-citations play an important role in the citation of the works and in the increase of their visibility.
Article
Web 2.0 or social web applications such as online discussion forums, blogs and Wikipedia have improved knowledge sharing by providing an environment in which users can generate and find their favorite content in, a flexible way. With the passage of time, online discussion forums accumulate a huge amount of content and this can introduce issues of content quality and user credibility. A poor-quality answer in a discussion forum indicates the presence of unprofessional or unqualified users; therefore, a priority is to find experts or reputable users. Most of the existing expert-ranking approaches consider basic features, such as the total number of answers provided by a user, but ignore the quality and consistency of the user's answer. In this paper, expert-ranking techniques using g-index are proposed, and are applied to a StackOverflow forum dataset. Three techniques are proposed including Exp-PC, Rep-FS and Weighted Exp-PC. Exp-PC is an adaptation of g-index for ranking experts in StackOverflow forum. In Rep-FS, several features like voters reputation, vote ratio are proposed to measure users' expertise while Weighted Exp-PC computes user expertise by combining their Exp-PC and Rep-FS scores. We measure users' reputation and expertise according to both the quality of their answer and their consistency in providing quality answers. The experimental results of the proposed expert-ranking techniques, Exp-PC and Weighted Exp-PC in particular, validate that these methods identify genuine experts in a more effective way.