ChapterPDF Available

Abstract and Figures

The growth of Big Data, especially personal data dispersed in multiple data sources, presents enormous opportunities and insights for businesses to explore and leverage the value of linked and integrated data. However, privacy concerns impede sharing or exchanging data for linkage across different organizations. Privacy-preserving record linkage (PPRL) aims to address this problem by identifying and linking records that correspond to the same real-world entity across several data sources held by different parties without revealing any sensitive information about these entities. PPRL is increasingly being required in many real-world application areas. Examples range from public health surveillance to crime and fraud detection, and national security. PPRL for Big Data poses several challenges, with the three major ones being (1) scalability to multiple large databases, due to their massive volume and the flow of data within Big Data applications, (2) achieving high quality results of the linkage in the presence of variety and veracity of Big Data, and (3) preserving privacy and confidentiality of the entities represented in Big Data collections. In this chapter, we describe the challenges of PPRL in the context of Big Data, survey existing techniques for PPRL, and provide directions for future research.
Content may be subject to copyright.
Privacy-Preserving Record Linkage
for Big Data: Current Approaches
and Research Challenges
Dinusha Vatsalan, Ziad Sehili, Peter Christen and Erhard Rahm
Abstract The growth of Big Data, especially personal data dispersed in multiple data
sources, presents enormous opportunities and insights for businesses to explore and
leverage the value of linked and integrated data. However, privacy concerns impede
sharing or exchanging data for linkage across different organizations. Privacy-
preserving record linkage (PPRL) aims to address this problem by identifying and
linking records that correspond to the same real-world entity across several data
sources held by different parties without revealing any sensitive information about
these entities. PPRL is increasingly being required in many real-world application
areas. Examples range from public health surveillance to crime and fraud detection,
and national security. PPRL for Big Data poses several challenges, with the three
major ones being (1) scalability to multiple large databases, due to their massive
volume and the flow of data within Big Data applications, (2) achieving high quality
results of the linkage in the presence of variety and veracity of Big Data, and (3)
preserving privacy and confidentiality of the entities represented in Big Data collec-
tions. In this chapter, we describe the challenges of PPRL in the context of Big Data,
survey existing techniques for PPRL, and provide directions for future research.
Keywords Record linkage ·Privacy ·Big data ·Scalability
D. Vatsalan ·P. Chr i s t e n
Research School of Computer Science, The Australian National University,
Acton, ACT 2601, Australia
e-mail: dinusha.vatsalan@anu.edu.au
P. Ch r i s t en
e-mail: peter.christen@anu.edu.au
Z. Sehili ·E. Rahm (B
)
Database Group, University of Leipzig, 04109 Leipzig, Germany
e-mail: rahm@informatik.uni-leipzig.de
Z. Sehili
e-mail: sehili@informatik.uni-leipzig.de
© Springer International Publishing AG 2017
A.Y. Zomaya and S. Sakr (eds.), Handbook of Big Data Technologies,
DOI 10.1007/978-3-319-49340-4_25
851
852 D. Vatsalan et al.
1 Introduction
With the Big Data revolution, many organizations collect and process datasets that
contain many millions of records to analyze and mine interesting patterns and knowl-
edge in order to empower efficient and quality decision making [28,53]. Analyzing
and mining such large datasets often require data from multiple sources to be linked
and aggregated. Linking records from different data sources with the aim to improve
data quality or enrich data for further analysis is occurring in an increasing number of
application areas, such as healthcare, government services, crime and fraud detection,
national security, and business applications [28,52]. Effective ways of linking data
from different sources have also played an increasingly important role in generating
new insights for population informatics in the health and social sciences [99].
For example, linking health databases from different organizations facilitates qual-
ity health data mining and analytics in applications such as epidemiological studies
(outbreak detection of infectious diseases) or adverse drug reaction studies [20,116].
These applications require data from several organizations to be linked, for example
human health data, travel data, consumed drug data, and even animal health data [38].
Linked health databases can also be used for the development of health policies in
a more efficient and effective way compared to traditionally used time-consuming
survey studies [37,88].
Record linkage techniques are also being used by national security agencies and
crime investigators for effective identification of fraud, crime, or terrorism sus-
pects [73,125,168]. Such applications require data from law enforcement agencies,
immigration departments, Internet service providers, businesses, as well as financial
institutions [125].
In recent time, record linkage is increasingly being required by social scien-
tists in the field of population informatics to study insights into our society from
‘social genome’ data, the digital traces that contain person-level data about social
beings [99]. The ‘Beyond 2011’ program by the Office for National Statistics in the
UK, for example, has carried out research to study different possible approaches to
producing population and socio-demographics statistics for England and Wales by
linking data from several sources [121].
Record linkage within a single organization does not generally involve privacy
and confidentiality concerns (assuming there are no internal threats within the orga-
nization and the linked data are not being revealed outside the organization). An
example application is the deduplication of a customer database by a business
using record linkage techniques for conducting effective marketing activities. How-
ever, in many countries record linkage across several organizations, as required in
the above example applications, might not allow the exchange or the sharing of
database records between organizations due to laws or regulations. Some example
Acts that describe the legal restrictions of disclosing personal or sensitive data are:
(1) the Data-Matching Program Act in Australia,1(2) the European Union (EU) Per-
1https://www.oaic.gov.au/privacy-law/other-legislation/government-data-matching [Accessed:
15/06/2016].
Privacy-Preserving Record Linkage for Big Data: Current Approaches … 853
sonal Data Protection Act in Europe,2and (3) the Health Insurance Portability and
Accountability Act (HIPAA) in the USA.3
The privacy requirements in the record linkage process have been addressed by
developing ‘privacy-preserving record linkage’ (PPRL) techniques, which aim to
identify matching records that refer to the same entities in different databases without
compromising privacy and confidentiality of these entities. In a PPRL project, the
database owners (or data custodians) agree to reveal only selected information about
records that have been classified as matches among each other, or to an external
party, such as a researcher [164]. However, record linkage requires access to the
actual values of certain attributes.
Known as quasi-identifiers (QIDs), these attributes need to be common in all
databases to be linked and represent identifying characteristics of entities to allow
matching of records. Examples of QIDs are first and last names, addresses, tele-
phone numbers, or dates of birth. Such QIDs often contain private and confidential
information of entities that cannot be revealed, and therefore the linkage has to be
conducted on masked (encoded) versions of the QIDs to preserve the privacy of
entities. Several masking techniques have been developed (as we will describe in
Sect. 3.4), using two different types of general approaches: (1) secure multi-party
computation (SMC) [111] and (2) data perturbation [87].
Leveraging the tremendous opportunities that Big Data can provide for businesses
comes with the challenges that PPRL poses, including scalability, quality, and pri-
vacy. Big Data implies enormous data volume as well as massive flows (velocity)
of data, leading to scalability challenges even with advanced computing technology.
The variety and veracity aspects of Big Data require biases, noise, variations and
abnormalities in data to be considered, which makes the linkage process more chal-
lenging. With Big Data containing massive amounts of personal data, linking and
mining data may breach the privacy of those represented by the data. A practical
PPRL solution that can be used in real-world applications should therefore address
these challenges of scalability, linkage quality, and privacy. A variety of PPRL tech-
niques has been developed over the past two decades, as surveyed in [154,164].
However, these existing approaches for PPRL fall short in providing a sound solu-
tion in the Big Data era by not addressing all of the Big Data challenges. Therefore,
more research is required to leverage the huge potential that linking databases in
the era of Big Data can provide for businesses, government agencies, and research
organizations.
In this chapter, we review the existing challenges and techniques, and discuss
research directions of PPRL for Big Data. We provide the preliminaries in Sect. 2
and review existing privacy techniques for PPRL in Sect.3. We then discuss the
scalability challenge and existing approaches that address scalability of PPRL in
Sect. 4. In Sect. 5, we describe the challenges and existing techniques of PPRL on
multiple databases, which is an emerging research avenue that is being increasingly
required in many Big Data applications. In Sect. 6we discuss research directions in
2http://ec.europa.eu/justice/ data-protection/ index_en.htm [Accessed: 15/06/2016].
3http://www.hhs.gov/ocr/privacy/ [Accessed: 15/06/2016].
854 D. Vatsalan et al.
PPRL for Big Data, and in Sect. 7we conclude this chapter with a brief summary of
the topic covered.
2 Background
Building on the introduction to record linkage and privacy-preserving record link-
age (PPRL) in Sect. 1, we now present background material that contributes to the
understanding of the preliminaries. We describe the basic concepts and challenges
in Sect.2.1, and then describe the process of PPRL in Sect. 2.2.
2.1 Overview and Challenges of PPRL
Record linkage is a widely used data pre-processing and data cleaning task where the
aim is to link and integrate records that refer to the same entity from two or multiple
disparate databases. The record pairs (when linking two databases) or record sets
(when linking more than two databases) are compared and classified as ‘matches’ by
a linkage model if they are assumed to refer to the same entity, or as ‘non-matches’
if they are assumed to refer to different entities [26,54]. The frequent absence of
unique entity identifiers across the databases to be linked makes it impossible to use
asimpleSQL-join[30], and therefore linkage requires sophisticated comparisons
between a set of QIDs (such as names and addresses) that are commonly available
in the records to be linked. However, these QIDs often contain personal information
and therefore revealing or exchanging them for linkage is not possible due to privacy
and confidentiality concerns.
As an example scenario, assume a demographer who aims to investigate how
mortgage stress (having to pay large sums of money on a regular basis to pay off
a house) is affecting people with regard to their mental and physical health. This
research will require data from financial institutions as well as hospitals as shown
in Tables 1and 2. Neither of these organizations is likely willing or allowed by law
to provide their databases to the researcher. The researcher only requires access to
some attributes of the records (such as loan type, balance amount, blood pressure,
and stress level) that are linked across these databases, but not the actual identities
of the individuals that were linked. However, personal details (such as name, age or
date of birth, gender, and address) are needed as QIDs to conduct the linkage due to
the absence of unique identifiers across the databases.
As illustrated in the above example application (shown in Tables1and 2), link-
ing records in a privacy-preserving context is important, as sharing or exchanging
sensitive and confidential personal data (contained in QIDs of records) between
organizations is often not feasible due to privacy concerns, legal restrictions, or
commercial interests. Therefore, databases need to be linked in such ways that no
sensitive information is being revealed to any of the organizations involved in a cross-
Privacy-Preserving Record Linkage for Big Data: Current Approaches … 855
Table 1 Example bank database
ID Given_name Surname DOB Gender Address Loan_type Balance
6723 Peter Robert 20.06.72 M16 Main
Street
2617
Mortgage 230,000
8345 Smith Roberts 11.10.79 M645
Reader
Ave 2602
Personal 8,100
9241 Amelia Millar 06.01.74 F49E
Apple-
cross Rd
2415
Mortgage 320,750
Table 2 Example health database
PID Last_name First_name Age Address Sex Pressure Stress Reason
P1209 Roberts Peter 41 16 Main
St 2617
m140/90 High Chest pain
P4204 Miller Amelia 39 49 Apple-
cross
Road
2415
f120/80 High Headache
P4894 Sieman Jeff 30 123
Norcross
Blvd 2602
m110/80 Normal Checkup
organizational linkage project, and no adversary is able to learn anything about these
sensitive data. This problem has been addressed by the emerging research area of
PPRL [164].
The basic ideas of PPRL techniques are to mask (encode) the databases at their
sources and to conduct the linkage using only these masked data. This means no
sensitive data are ever exchanged between the organizations involved in a PPRL
protocol, or revealed to any other party. At the end of such a PPRL process, the
database owners only learn which of their own records match with a high similar-
ity with records from the other database(s). The next steps would be exchanging
the values in certain attributes of the matched records (such as loan type, balance
amount, blood pressure, and stress level in the above example) between the database
owners, or sending selected attribute values to a third party, such as a researcher
who requires the linked data for their project [164]. Recent research outcomes and
experiments conducted in real health data linkage validate that PPRL can achieve
linkage quality with only small loss compared to traditional record linkage using
unencoded QIDs [134,135].
Using PPRL for Big Data involves many challenges, among them the follow-
ing three key challenges need to be addressed to make PPRL viable for Big Data
applications:
856 D. Vatsalan et al.
1. Scalability: The number of comparisons required for classifying record pairs or
sets equals to the product of the size of the databases that are linked. This is a
performance bottleneck in the record linkage process since it potentially requires
comparison of all record pairs/sets using expensive comparison functions [9,
31]. Due to the increasing size of Big Data (volume), comparing all records is
not feasible in most real-world applications. Blocking and filtering techniques
have been used to overcome this challenge by eliminating as many comparisons
between non-matching records as possible [9,29,150].
2. Linkage quality: The emergence of Big Data brings with it the challenge of deal-
ing with typographical errors and other variations in data (variety and veracity)
making the linkage more challenging. The exact matching of QID values, which
would classify pairs or sets of records as matches if their QIDs are exactly the
same and as non-matches otherwise, will likely lead to low linkage accuracy in
the presence of real-world data errors. In addition, the classification models used
in record linkage should be effective and accurate in classifying matches and non-
matches [31]. Therefore, for practical record linkage applications, techniques are
required that facilitate both approximate matching of QID values for comparison,
as well as effective classification of record pairs/sets for high linkage accuracy.
3. Privacy: The privacy-preserving requirement in the record linkage process adds
a third challenge, privacy, to the two main challenges of scalability and linkage
quality [164]. Linking Big Data containing massive amounts of personal data
generally involves privacy and confidentiality issues. Privacy needs to be consid-
ered in all steps of the record linkage process as only the masked (or encoded)
records can be used, making the task of linking databases across organizations
more challenging. Several masking techniques have been used for PPRL, as we
will discuss in detail in Sect. 3.4.
2.2 The PPRL Process and Techniques Used
In this section we discuss the steps and the techniques used in the PPRL process, as
shown in Fig. 1.
Data Pre-processing and Masking: The first important step for quality linkage
outcomes is data pre-processing. Real-world data are often noisy, incomplete and
inconsistent [8,128], and they need to be cleaned in this step by filling in missing
data, removing unwanted values, transforming data into well-defined and consistent
forms, and resolving inconsistencies in data representations and encodings [28,36].
In PPRL, data masking (encoding) is an additional step. Data pre-processing and
masking can be conducted independently at each data source. However, it is important
that all database owners (or parties) who participate in a PPRL project conduct the
same data pre-processing and masking steps on the data they will use for linking.
Some exchange of information between the parties about what data pre-processing
Privacy-Preserving Record Linkage for Big Data: Current Approaches … 857
processing
Data pre
Database
A
D
Data
masking
Data
masking
Matches
matches
Possible
Evaluation
Non
Comparison tion
Classifica
Clerical
review matches
Blocking/
filtering
Blocking/
filtering
Database
B
Dprocessing
Data pre
Fig. 1 Outline of the general PPRL process as discussed in Sect. 2.2. The steps shown in dark out-
lined boxes need to be conducted on masked database records, while dotted arrows show alternative
data flows between steps
and masking approaches they use, as well as which attributes to be used as QIDs, is
therefore required [164].
Blocking/filtering: Blocking/filtering is the second step of the process, which is
aimed at reducing the number of comparisons that need to be conducted between
records by pruning as many pairs or sets of records as possible that unlikely corre-
spond to matches [9,29]. Blocking groups records according to a blocking criteria
(blocking key) such that comparisons are limited to records in the same (or similar)
blocks, while filtering prunes potential non-matches based on their properties (e.g.
length differences of QIDs) [29]. The output of this step are candidate record pairs
(or sets) that contain records that are potentially matching, which need to be com-
pared in more detail. Blocking/filtering can either be conducted on masked records
or locally by the database owners on unmasked records. The scalability challenge of
PPRL has been addressed by several recent approaches using private blocking and
filtering techniques [46,78,131,133,149,150,159,163], as will be described in
Sects. 4and 5.1.
Comparison: Candidate record pairs (or sets) are compared in detail in the com-
parison step using comparison (or similarity) functions [32]. Various comparison
functions have been used in record linkage including Levenshtein edit distance,
Jaro-Winkler comparison, Soft-TFIDF string comparison, and token-based compar-
ison using the Overlap, Dice, or Jaccard coefficient [28]. These comparison functions
provide a numerical value representing the similarity of the compared QID values,
often normalized into the [0,1]interval where a similarity of 1 corresponds to two
values being exactly the same, and 0 means two values being completely different.
Several QIDs are normally used when comparing records, resulting in one weight
vector for each compared record pair that contains the numerical similarity values
of all compared QIDs.
858 D. Vatsalan et al.
The QIDs of records often contain variations and errors, and therefore simply
masking these values with a secure one-way hash-encoding function (as will be
described in Sect. 3.4) and comparing the masked values will not result in high linkage
quality for PPRL [35,122]. A small variation in a pair of QID values will lead to
completely different hash-encoded values [35], which enables only exactly matching
QID values to be identified with such a simple masking approach. Therefore, an
effective masking approach for securely and accurately calculating the approximate
similarity of QID values is required. Several approximate comparison functions have
been adapted into a PPRL context, including the Levenshtein edit distance [76] and
the Overlap, Dice, and Jaccard coefficients [164].
Classification: In the classification step, the weight vectors of the compared candi-
date record pairs (or sets) are given as input to a decision model which will classify
them into matches, non-matches, and possible matches [31,54,63], where the latter
class is for cases where the classification model cannot make a clear decision. A
widely used classification approach for record linkage is the probabilistic method
developed by Fellegi and Sunter in the 1960s [54]. In this model, the likelihood that
a pair (or set) of records corresponds to a match or a non-match is modelled based
on a-priori error estimates on the data, frequency distributions of QID values, as well
as their similarity calculated in the comparison step [28]. Other classification tech-
niques include simple threshold-based and rule-based classifiers [28]. Most PPRL
techniques developed so far employ a simple threshold-based classification [164].
Supervised machine learning approaches, such as support vector machines and
decision trees [14,25,51,52], can be used for more effective and accurate classifica-
tion results. These require training data with known class labels for matches and non-
matches to train the decision model. Once trained, the model can be used to classify
the remaining unlabelled pairs/sets of records. Such training data, however, are often
not available in real record linkage applications, especially in privacy-preserving set-
tings [28]. Alternatively, semi-supervised techniques (such as active learning-based
techniques [3,11,169]), that actively use examples manually labeled by experts
to train and improve the decision model, need to be developed for PPRL. Recently
developed advanced classification models, such as (a) collective linkage [13,74] that
considers relational similarities with other records in addition to QID similarities, (b)
group linkage [123] that calculates group of records’ similarities based on pair-wise
similarities, and (c) graph-based linkage [58,66,74] that considers the structure
between groups of records, can achieve high linkage quality at the cost of higher
computational complexity. However, these advanced classification techniques have
not been explored for PPRL so far.
Clerical review: The record pairs/sets that are classified as possible matches require
a clerical review process, where these pairs are manually assessed and classified into
matches or non-matches [171]. This is usually a time-consuming and error-prone
process which depends upon experience of the experts who conduct the review. Active
learning-based approaches can be used for clerical review [3,11,169]. However,
clerical review in its current form is not possible in a PPRL scenario since the actual
QID values of records cannot be inspected because this would reveal sensitive private
Privacy-Preserving Record Linkage for Big Data: Current Approaches … 859
information. Recent work in PPRL suggests an interactive approach with human-
machine interaction to improve the quality of linkage results without compromising
privacy [100].
Evaluation: The final step in the process is the evaluation of the complexity, quality,
and privacy of the linkage to measure the applicability of a linkage project in an
application before implementing it into an operational system. A variety of evaluation
measures have been proposed [29,31]. Given in a practical record linkage application
the true match status of the compared record pairs are unlikely to be known, measuring
linkage quality is difficult [6,31]. How to evaluate the amount of privacy protection
using a set of standard measures is still an immature aspect in the PPRL literature.
Vatsalan et al. recently proposed a set of evaluation measures for privacy using
probability of suspicion [165]. Entropy and information gain, between unmasked
and masked QID values, have also been used as privacy measures [46].
Tools: Different record linkage approaches have been implemented within a number
of tools. Koepcke and Rahm provided a detailed overview of eleven such tools
in [95] both categories of with and without the use of learning-based (supervised)
classification. The comparative evaluation study [96] benchmarks selected tools from
both categories for four real-life test cases. It is found that learning-based approaches
achieve generally better linkage quality especially for complex tasks requiring the
combination of several attribute similarities. Current tools for link discovery, i.e.,
matching entities between sources of linked open data web, are surveyed in [119].
A web-based tool was recently developed to demonstrate several multi-party PPRL
approaches (as will be described in Sect. 5)[132].
3 Privacy Aspects and Techniques for PPRL
Several dimensions of privacy need to be considered for PPRL, the four main ones
being: (1) the number of parties and their roles, (2) adversary models, (3) privacy
attacks, and (4) data masking or encoding techniques. In Sects. 3.13.4, we describe
these four privacy dimensions, and in Sect. 3.5 we provide an overview of Bloom
filter-based data masking, a technique widely used in PPRL.
3.1 PPRL Scenarios
PPRL techniques for linking two databases can be classified into those that require a
linkage unit for performing the linkage and those that do not. The former are known
as ‘three-party protocols’ and the latter as ‘two-party protocols’ [24,27,167]. In
three-party protocols, a (trusted) third party acts as the linkage unit to conduct the
linkage of masked data received from the two database owners, while in two-party
860 D. Vatsalan et al.
DADB
Linkage
unit
2. Masked data
1. Parameters
2. Masked data
3. Record IDs
of matches
Database
A
owner Database
B
owner
3. Record IDs
of matches
DADB
1. Parameters
2. Masked data
2. Masked data
of matches
Database
A
owner Database
B
owner
3. Record IDs
Fig. 2 Outline of PPRL protocols with (left) and without (right) a linkage unit (also known as
three-party and two-party protocols, respectively)
protocols only the two database owners participate in the PPRL process. A conceptual
illustration and the main communication steps involved in these protocols are shown
in Fig.2.
A further characterization of PPRL techniques is if they allow the linking of data
from more than two data sources (multi-party) or not. Multi-party PPRL techniques
identify matching record sets (instead of record pairs) from all parties (more than
two) involved in a linkage, or from sub-sets of parties. Only limited work has been
so far conducted on multi-party PPRL due to its increased challenges, as we will
describe in Sect. 5. Similar to linking two databases, multi-party PPRL may or may
not use a linkage unit to perform the linkage.
Protocols that do not require a linkage unit are more secure in terms of collusion
(described below) between one of the database owners and the linkage unit. However,
they generally require more complex techniques to ensure that the database owners
cannot infer any sensitive information about each other’s data during the linkage
process.
3.2 Adversary Models
Different adversary models are assumed in PPRL techniques, including the most
commonly used honest-but-curious (HBC) and malicious models [164].
1. Honest-but-curious (HBC) or semi-honest parties are curious in that they try
to find out as much as possible about another party’s input to a protocol while
following the protocol steps [65,111]. If all parties involved in a PPRL protocol
have no new knowledge at the end of the protocol above what they would have
learned from the output, which is generally the record pairs (certain attributes)
classified as matches, then the protocol is considered to be secure in the HBC
model. However, it is important to note that HBC does not prevent parties from
colluding with each other with the aim to learn about another party’s sensitive
information [111]. Most of the existing PPRL solutions assume the HBC adver-
sary model.
2. Malicious parties can behave arbitrarily in terms of refusing to participate in
a protocol, not following the protocol in the specified way, choosing arbitrary
Privacy-Preserving Record Linkage for Big Data: Current Approaches … 861
values for their data input, or aborting the protocol at any time [110]. Limited
work has been done in PPRL for the malicious adversary model [57,105,118].
Evaluating privacy under this model is very difficult, because there exist poten-
tially unpredictable ways for malicious parties to deviate from the protocol that
cannot be identified by an observer [21,62,111].
3. Covert and accountable computing models are advanced adversary models
developed to overcome the problems associated with the HBC and malicious
models. The HBC model is not sufficient in many real-world applications because
it is suitable only when the parties essentially trust each other. On the other hand,
the solutions that can be used with malicious adversaries are generally more com-
plex and have high computation and communication complexities, making their
applications not scalable to large databases. The covert model guarantees that
the honest parties can detect the misbehavior of an adversary with high prob-
ability [4], while the accountable computing model provides accountability for
privacy compromises by the adversaries without excessive complexity and cost
that incur with the malicious model [72]. Research is required towards transform-
ing existing HBC or malicious PPRL protocols into these models and proving
privacy of solutions under these models.
3.3 Attacks
The privacy attacks or vulnerabilities that a PPRL technique is susceptible to allow
theoretical and empirical analysis of the privacy guarantees provided by the PPRL
technique. The main privacy attacks of PPRL are:
1. Dictionary attack is possible with masking functions, where an adversary masks
a large list of known values using various existing masking functions until a
matching masked value is identified [164]. A keyed masking approach, such as
the Hashed Message Authentication Code (HMAC) [97], can be used to prevent
dictionary attacks. With HMAC the database owners exchange a secret code
(string) that is added to all database values before they are masked. Without
knowing the secret key, a dictionary attack is unlikely to be successful.
2. Frequency attack is still possible even with a keyed masking approach [164],
where the frequency distribution of a set of masked values is matched with the
distribution of known unmasked values in order to infer the original values of the
masked values [112].
3. Cryptanalysis attack is a special category of frequency attack that is applicable
to Bloom filter-based data masking techniques. As Kuzu et al. [101]haveshown,
depending upon certain parameters of Bloom filter masking, such as the number
of hash functions employed and the number of bits in a Bloom filter, using a
constrained satisfaction solver allows the iterative mapping of individual masked
values back to their original values.
862 D. Vatsalan et al.
4. Composition attack can be successful by combining knowledge from more than
one independent masked datasets to learn sensitive values of certain records [60].
An attack on distance-preserving perturbation techniques [155], for example,
allows the original values to be re-identified with high level of confidence if
knowledge about mutual distances between values is available.
5. Collusion is another vulnerability associated with multi-party or three-party
PPRL techniques, where some of the parties involvedin the protocol work together
to find out about another database owner’s data. For example, one or several data-
base owners collude with the linkage unit or a sub-set of database owners collude
among them to learn about other parties’ data.
3.4 Data Masking or Encoding
In PPRL, the linkage has to be conducted on a masked or encoded version of the QIDs
to preserve the privacy of entities. Data masking (encoding) transforms original data
in such a way that there exists a specific functional relationship between the original
data and the masked data [55]. Several data masking functions have been used to
preserve privacy while allowing the linkage. We categorize existing data masking
techniques into three: (1) auxiliary, (2) blocking, and (3) matching techniques. Aux-
iliary techniques are the ones used as helper functions in PPRL, while blocking and
matching categories are used for private blocking and matching (comparison and
classification), respectively. In the following we describe key techniques in each of
these three categories.
Auxiliary:
1. Pseudo random function (PRF) is a deterministic secure function that, when
given an n-bit seed kand an n-bit argument x, returns an n-bit string fk(x)such
that it is infeasible to distinguish fk(x)for different random kfrom a truly random
function [114]. In PPRL, PRFs have been used to generate random secret values
to be shared by a group of parties [57,122,151].
2. Reference values constructed either with random faked values, or values that for
example are taken from a public telephone directory, such as all unique surnames
and town names, have been used in several PPRL approaches [77,124,141,173].
Such lists of reference values can be used to calculate the distances or similarities
between QID values in terms of the distances or similarities between QID and
reference values.
3. Noise addition in the form of extra records or QID values that are added to the
databases to be linked is a data perturbation technique [86] that can be used to
overcome the problem of frequency attacks on PPRL protocols [44,100]. An
example is shown in Fig. 3. Adding extra records, however, incurs a cost of lower
linkage quality (due to false matches) and scalability (due to extra records that
need to be processed and linked) [79]. Section 4.1 discusses noise addition for
private blocking.
Privacy-Preserving Record Linkage for Big Data: Current Approaches … 863
QID valuesPhonetic code
m460
p360
s530
millar myler
peter
smith smyth smitth
QID values
2re0a4i
yn40s21
zm0r04h 4hd0ffd
phonetic code
Hashencoded Hashencoded
51dc3jh6le
e78gh7la3i
2jdf60q 72gfi2b
46sjb321p0 i29uh7s
o3zkatn7v2lki0
Phonetic code
s530
p360
m460 millar myler miller
peter petar pete
smith smyth smitth
QID values + noise
Fig. 3 An example of phonetic encoding, noise addition, and secure hash-encoding (adapted
from [164]). Values represented with dotted outlines are added noise to overcome frequency
attacks [165] (as will be discussed in detail in Sect. 4.1)
4. Differential privacy [50] has emerged as an alternative to random noise addition
in PPRL. Only the perturbed results (with noise) of a set of statistical queries
are disclosed to other parties, such that the probability of holding any property
on the results is approximately the same whether or not an individual value is
present in the database [50]. In recent times, differential privacy has been used in
statistical (e.g. counts or frequencies) microdata publication as well as in PPRL
[16,68,103].
Blocking:
1. Phonetic encoding, such as Soundex, NYSIIS or Double-Metaphone, groups
values together that have a similar pronunciation [23] in a one-to-many mapping,
as shown in Fig. 3. The main advantage of using a phonetic encoding for PPRL
is that it inherently provides privacy [79], reduces the number of comparisons,
and thus increases scalability [23], and supports approximate matching [23,79].
Two drawbacks of phonetic encodings are that they are language dependent
[126,146] and are vulnerable to frequency attacks [165]. Section 4.1 discusses
phonetic encoding-based blocking in more details.
2. Generalization techniques overcome the problem of frequency attacks on
records by generalizing records in such a way that re-identification of individual
records from the masked data is not feasible [106,115,152]. k-anonymity is a
widely used generalization technique for PPRL [75,77,118], where a database is
said to be k-anonymous if every combination of masked QID values (or blocking
key values) is shared by at least krecords in the database [152]. Other gen-
eralization techniques include value generalization hierarchies [67], top-down
specialization [118], and binning [108,162].
Matching:
1. Secure hash-encoding is one of the first techniques used for PPRL [17,49,127].
The widely known Message Digest (like MD5) and Secure Hash Algorithms (like
SHA-1 and SHA-2) [143] are one-way hash-encoding functions [97,143] that
can be used to mask values into hash-codes (as shown in Fig.3) such that having
access to only hash-codes will make it nearly impossible with current computing
technology to infer their original values. A major problem with this masking
technique is, however, that only exact matches can be found [49]. Even a single
864 D. Vatsalan et al.
character difference between a pair of original values will result in two completely
different hash-codes.
2. Statistical linkage key (SLK) is a derived variable generated from components
of QIDs. The SLK-581 was developed by the Australian Institute of Health
and Welfare (AIHW) to link records from the Home and Community Care
datasets [140]. The SLK-581 consists of the second and third letters of first
name, the second, third and fifth letters of surname, full date of birth, and sex.
Similarly, SLK consisting of month and year of birth, sex, zipcode, and initial of
first name was used for linking the Belgian national cancer registry [157]. How-
ever, as a recent study has shown these SLK-based masking provides limited
privacy protection and poor sensitivity [136].
3. Embedding space embeds QID values into a multi-dimensional metric space
(such as Euclidean [16,141,173]orHamming[81]) while preserving the dis-
tances between these values using a set of pivot values that span the multi-
dimensional space. A drawback with this approach is that it is often difficult to
determine a good dimension of the metric space and select suitable pivot values.
4. Encryption schemes, such as commutative [1] and homomorphic [92] encryp-
tion, are used in PPRL techniques to allow secure multi-party computation (SMC)
in such a way that at the end of the computation no party knows anything except
its own input and the final results of the computation [38,62,111]. The secure
set union, secure set intersection, and secure scalar product, are the most com-
monly used SMC techniques for PPRL [38,143]. A drawback of these crypto-
graphic encryption schemes for SMC, however, is that they are computationally
expensive.
5. Bloom filter is a bit vector data structure into which values are mapped by using
a set of hash functions. Bloom filters have been widely used in PPRL for private
matching of records as they provide a means of privacy assurance [46,47,76,
104,147,170], if effectively used [102]. We will discuss Bloom filter masking
in more detail in the following section.
6. Count-min sketches are probabilistic data structures (similar to Bloom filters)
that can be used to hash-map values along with their frequencies in a sub-linear
space [41]. Count-min sketches have been used in PPRL where the frequency of
occurrences of a matching pair/set also needs to be identified [80,139]. However,
these approaches only support exact matching of categorical values.
Other privacy aspects in a PPRL project are the secure generation and exchange
of public/private key pairs, employee confidentiality agreements to reduce internal
threats, as well as encrypted communication, secure connections, and secure servers
to reduce external threats.
Privacy-Preserving Record Linkage for Big Data: Current Approaches … 865
3.5 Bloom Filters
Bloom filter encoding has been used as an efficient masking technique in a variety
of PPRL approaches [46,104,130,148,150,158,160]. A Bloom filter biis a
bit vector data structure of length lbits where all bits are initially set to 0. kinde-
pendent hash functions, h1,h2,...,hk, each with range 1,...l, are used to map
each of the elements sinasetSinto the Bloom filter by setting the bit positions
h1(s), h2(s),...,hk(s)to 1. The Bloom filter was originally proposed by Bloom [15]
for efficiently checking set membership [19].Laietal.[104] first adopted the con-
cept of using Bloom filters in PPRL for identifying exactly matching records across
multiple databases, as will be described in Sect. 5.2.
Schnell et al. [148] were the first to propose a method for approximate matching
in PPRL using Bloom filters. In their approach, the character q-grams (sub-strings of
length q) of QID values of each record in the databases to be linked are hash-mapped
into a Bloom filter using kindependent hash functions. The resulting Bloom filters
are sent to a linkage unit that calculates the similarity between Bloom filters using a
set-based similarity function, such as the Dice-coefficient [28]. The Dice-coefficient
of two Bloom filters (b1and b2) is calculated as:
Dice_sim(b1,b2)=2×c
(x1+x2),(1)
where cis the number of common bit positions that are set to 1 in both Bloom filters
(common 1-bits), and xiis the number of bit positions set to 1 in bi(1-bits), i∈{1,2}.
An example similarity calculation is illustrated in Fig. 4.
Bloom filters are susceptible to cryptanalysis attacks, as shown by Kuzu et
al. [101]. Using a constrained satisfaction solver, such attacks allow the iterative
mapping of individual hash-encoded values back to their original values depending
upon the number of hash functions employed and the length of a Bloom filter. Differ-
ent Bloom filter encoding methods have been proposed in the literature to overcome
such cryptanalysis attacks and improve linkage quality. Schnell et al.’s proposed
method of hash-mapping all QID values of a record into one composite Bloom filter
is known as Cryptographic Long-term Key (CLK) encoding [148].
Fig. 4 An example
similarity (Dice-coefficient)
calculation of Bloom filters
for approximate matching
using Schnell et al.’s
approach [147](taken
from [164]) pe et te
teetre
11
11
1
1
1
1
pe
1
1
0
0
0
0
0
0
0
0
1
0
00 01
00 0 0
Num common
1bits Dice_sim =
Num 1bits
5
7
= 0.83
(7+5)
2 x 5
c = 5
b
2
b
1
866 D. Vatsalan et al.
Durham et al. [48] investigated composite Bloom filter encoding in detail by
first hash-mapping different attributes into attribute-level Bloom filters of differ-
ent lengths. These lengths depend upon the weights [54] of QID attributes that are
calculated using the discriminatory power of attributes in separating matches from
non-matches using a statistical approach. These attribute-level Bloom filters are then
combined into one record-level Bloom filter (known as RBF) by sampling bits from
each attribute-level Bloom filter. Vatsalan et al. [165] recently introduced a hybrid
method of CLK and RBF (known as CLKRBF) where the Bloom filter length is kept
to be the same (as in CLK) while using different numbers of hash functions to map
different attributes into the Bloom filter based on their weights (to improve matching
quality as in RBF).
Several non-linkage unit-based approaches have also been proposed for PPRL
using Bloom filter masking, where the database owners (without a linkage unit)
collaboratively (or distributively) calculate the similarity of Bloom filters [104,158,
160]. A recent work proposed novel Bloom filter-based masking techniques that allow
approximate matching of numerical data in PPRL [161]. Instead of hash-mapping q-
grams of a string, the proposed approaches hash-map a set of neighbouring numerical
values to allow approximate matching.
4 Scalability Techniques for PPRL
PPRL for Big Data needs to scale to very large data volumes of many millions of
records from multiple sources. As for standard record linkage, the main techniques
for high efficiency are to reduce the search space by blocking and filtering approaches
and to perform record linkage in parallel on many processors.
These three kinds of optimization are largely orthogonal so that they may be
combined to achieve maximal efficiency. Blocking is defined on selected attributes
(blocking keys) that may be different from the QID attributes used for comparison, for
example zip code. It partitions the records in a database into several blocks or clusters
such that comparison can be restricted to the records of the same block, for example
persons with the same zip code. Other blocking approaches like sorted neighborhood
work differently but are similar in spirit. Filtering is an optimization for the particular
comparison approach which optimizes the evaluation of a specific similarity measure
for a predefined similarity threshold to be met by matching records. It thus utilizes
different filtering or indexing techniques to eliminate pairs (or sets) of records that
cannot meet the similarity threshold for the selected similarity measures [28,43].
Such techniques can be applied for comparison within blocks, i.e., filtering could be
utilized in addition to blocking.
In the next two subsections, we discuss several proposed blocking and filtering
approaches for PPRL. We then briefly discuss parallel PPRL which has found only
limited attention so far. We will focus on PPRL with two data sources (multi-party
PPRL is discussed in Sect. 5). We will furthermore mostly assume the use of a
dedicated linkage unit (as shown on the left-hand side in Fig. 2)aswellasthemasking
Privacy-Preserving Record Linkage for Big Data: Current Approaches … 867
of records using Bloom filters (as described in Sect. 3.5). Note that a linkage unit is
ideally suited for high performance as it requires minimal communication between
the database owners, and it can utilize a high performance cluster for parallel PPRL
as well as blocking and filtering techniques.
4.1 Blocking Techniques
Blocking aims at reducing the search space for linkage by avoiding the comparison
between every pair of records and its associated quadratic complexity. There are
numerous blocking strategies [28] that mostly group records into disjoint or overlap-
ping blocks such that only records within the same block need to be compared with
each other. Standard blocking uses the values of a so-called blocking key to partition
all records into disjoint blocks. The blocking key values (BKVs) may be the values
of a selected attribute or the result of a function on one or several attribute values
(e.g. the concatenation of the first two letters of last name and year of birth). Other
blocking approaches include canopy clustering that results in overlapping clusters
or blocks, and sorted neighborhood that sorts records according to a sorting key
and only compares neighboring records within a certain window [28]. Comparing
records only within the predetermined blocks may result in the loss of some matches
especially if some BKVs are incorrect or missing. To improve recall a multi-pass
blocking approach can be utilized, where records are blocked according to different
blocking keys, at the cost of an increased number of additional comparisons.
Blocking for PPRL is based on the known approaches for regular record linkage
but aims at improving privacy. A general approach with a central linkage unit is to
apply a previously agreed on blocking strategy by the database owners on the original
records. Then all records within the different blocks are masked (encoded), e.g. using
Bloom filters, and sent to the linkage unit. The linkage unit can then compare the
masked records block-wise with each other. In the following, we present selected
blocking approaches for PPRL and discuss results from a comparative evaluation of
different blocking schemes.
Phonetic Blocking: Blocking records based on their phonetic code is a widely used
technique in record linkage [28]. The basic idea is to encode the values of a blocking
key attribute (e.g. last name) with a phonetic function (as discussed in Sect.3.4) such
as Soundex or Metaphone [28]. All records with the same phonetic code, i.e. with a
similar pronunciation, are then assigned to the same block. The phonetic blocking
has been used in several PPRL approaches, in particular in [76,79]. Karakasidis et
al. in [76] use a multi-pass approach with both Soundex and Metaphone encodings to
achieve a good recall. Furthermore, they add fake records to the blocks for improved
privacy.
As discussed in Sect. 3.4, adding fake records improves privacy but adds overhead
in the form of extra comparisons between records and can reduce linkage quality due
to the introduction of false matches. A theoretical analysis of the impact of adding
868 D. Vatsalan et al.
fake records for Soundex-based blocking is presented in [79]. The authors study the
effect of fake records on the so-called relative information gain which is related to the
entropy measure. A high entropy within blocks caused by fake records introduces
a high uncertainty and thus a low information gain [165]. The authors also study
different methods to inject fake records to increase entropy. The most flexible of the
approaches is based on the concept of k-anonymity and adds as many fake records
as required to ensure that each block has at least krecords. The approach typically
requires only the addition of a modest number of fake records; the choice of kalso
supports finding a good trade-off between privacy and quality.
Blocking with Reference Values: An alternative to adding fake records for improv-
ing the privacy of blocking is the use of reference values (as discussed in Sect. 3.4).
The reference values can be used by the database owners for clustering their database
records. Comparison can then be restricted to the clusters (blocks) of the same or
similar reference records. Such an approach has been proposed in [77] based on k
nearest neighbor (kNN) clustering. This approach first clusters the reference values
identically at each database owner such that each cluster contains at least kreference
values to ensure k-anonymity; clustering is based on the Dice-coefficient similarity
between values. In the next step, each database owner assigns its records to the nearest
cluster based on the Dice-coefficient between records and reference values. Finally
each database owner sends its clusters (encoded reference values and records) to the
linkage unit which then compares the records between corresponding clusters.
An alternate proposal utilizes a local sorted neighborhood clustering (SNC-3P)
for improved performance in the blocking phase while retaining the use of reference
values and support for k-anonymity [159]. Each database owner sorts a shared set
of reference values and then inserts its records into the sorted list according to their
sorting key. From the sorted list of reference values and records the initial Sorted
Neighborhood Clusters (SNCs) are determined such that each cluster contains one
reference value and a set of database records. To ensure k-anonymity, the initial clus-
ters are merged into larger blocks containing at least kdatabase records. This differs
from kNN clustering where kreference records are needed per cluster. The merging
of the initial clusters can be based on similarity or size constraints. The remaining
protocol with sending the encoded records to the linkage unit for comparison works
as for kNN clustering.
An adaptation of SNC-3P for two parties without a linkage unit (SNC-2P)was
presented in [163]. In this approach, the two database owners generate their reference
values independently, so that they end up with two different sets of reference values.
As for SNC-3P, each database owner sorts its reference values, inserts its records
into the sorted list, builds initial SNCs (with one reference value and its associated
records), and merges these clusters to guarantee k-anonymity. Afterwards the data-
base owners exchange their reference values. These values are then merged with
the local reference values and sorted. To find candidate pairs between the sources a
sorted neighborhood method with a sliding window wis applied on these reference
values. The window size wdetermines the number of reference values originating
from each data source, e.g. for w=2 the sliding window includes 2 reference values
Privacy-Preserving Record Linkage for Big Data: Current Approaches … 869
from each data source. In the last step, the encoded records falling into a window are
exchanged for comparison.
LSH-based blocking: Locality-sensitive hashing (LSH) has been proposed to solve
the problem of nearest neighbor search in high dimensional spaces [61,69]. For
blocking, LSH uses a family of hash functions to generate keys used to partition the
records in a database, so that similar records are grouped into the same block [89].
Durham investigated the use of LSH for private blocking of records masked as Bloom
filters [46]. She considered two families of hash functions depending on the used dis-
tance function (Jaccard or Hamming distance). For the Jaccard distance, she proposed
the use of MinHash functions. A MinHash function hipermutes the bits of a Bloom
filter biand selects from the permuted Bloom filter the first index position with a set
bit (1-bit). By applying φMinHash functions we obtain φindex positions which are
concatenated to generate the final MinHash key. So for Bloom filter biand the family
of hash functions H, we determine key(bi)H=concat (h1(bi), h2(bi),...,hφ(bi)),
where hjHwith 1 jφand function concat() concatenates a set of input val-
ues. For the Hamming distance, Durham proposed the use of HLSH hash functions
that select the bit value of a Bloom filter at a random position ρ.Inthesamewayas
MinHash, φHLSH functions are applied on a Bloom filter biand the values of the
φselected bits are concatenated to obtain the final hash key of bi.
Example Consider two Bloom filters b1=1100100011 and b2=1100100111,
two permutations p1=(10,7,1,3,6,4,2,9,5,8)and p2=(4,3,6,7,2,1,5,
10,9,8), and the MinHash family H1with two functions h1=Min(p1(·)) and
h2=Min(p2(·)), where Min(·)returns the first position of a set bit in the input
bit vector, and pi(·)returns the input bit vector permuted using pi. The applica-
tion of h1and h2on b1results in h1(b1)=Min(p1(b1)) =Min(1010001110)=1
and h2(b1)=Min(p2(b1)) =Min(0000111110)=5. Hence the key of b1in H1
is key(b1)H1=concat(h1(b1), h2(b1)) =(1,5). In the same way we determine the
key of b2, i.e. key(b2)H1=(1,5). Hence, for MinHash family H1records b1and b2
are put into the same block and will be compared with each other.
Both families, MinHash and HLSH, depend on two parameters: the number of
hash functions φas well as the number of passes or iterations μ. Since the final hash
key of a record concatenates φvalues, using a high φleads to more homogeneous
blocks and better precision (i.e., blocks containing similar records with higher prob-
ability). However a high φalso increases the probability of missing true matches
(reduced recall). This problem is addressed by applying μiterations, each with a
different set of hash functions. Therefore each record biwill be hashed to several
blocks to allow identifying more true matches.
In [83] the authors present a theoretical analysis of the use of MinHash functions
to identify good values for parameters φopt and μopt to efficiently achieve a good
precision and recall. The naïve approach to improve recall is to increase the number
of iterations μand thus the number of blocks to which records are assigned. The
drawbacks of this method are the high runtime caused by the computation of the
permutations, increased number of record pairs to compare, and the large space
needed to store intermediate results. This observation was experimentally confirmed
870 D. Vatsalan et al.
in [46]. The choice of the φopt is also complex and depends on the expected running
time (for details see [83]).
Evaluation of Private Blocking Approaches: The relatively large number of pos-
sible blocking approaches requires detailed evaluations regarding their relative scal-
ability, blocking quality and privacy for different kinds of workloads. One of the few
studies in this respect has been presented by Vatsalan et al. [165]. For scalability
they considered runtime and the so-called reduction ratio (RR), a value indicating
the number of pruned candidate pairs compared to all possible record pairs (which
thus evaluates to what degree the search space is reduced). For blocking quality they
considered the recall and precision metrics pair completeness (PC) and pair quality
(PQ), respectively [29]. For privacy they estimated the so-called disclosure risk (DR)
measures, which represent the probability that masked records/QID values can be
linked with records or values in a publicly available dataset.
The evaluation of [165] considers six simulated blocking strategies including
kNN [77], SNC-3P [159], SNC-2P [163] and LSH blocking [46]. Regarding blocking
runtime, the SNC and LSH schemes performed best. All strategies except SNC-2P
achieved a very high RR of almost 1. On the other hand, SNC-2P achieved the best
PC. The best trade-off between RR and PC was observed for LSH. Considering the
privacy aspect, SNC-2P was found to have a low DR while kNN and LSH generally
expose the highest DR.
While this study provides interesting results, we see a need for additional bench-
mark studies given that further blocking schemes have been developed more recently
and that the relative behavior of each approach depends on numerous parameter set-
tings as well as characteristics of the chosen datasets.
4.2 Filtering Techniques
Almost all proposed PPRL schemes based on Bloom filters aim at identifying
the pairs of bit vectors with a similarity above a threshold. For regular record
linkage, such a threshold-based comparison of record pairs is known as a sim-
ilarity join [39]. The efficient processing of such similarity joins for different
kinds of similarity measures has been the focus of much research in the past, e.g.
[2,64,138,172]. Several approaches utilize the characteristics of the considered
similarity measure and the prespecified similarity threshold to reduce the search
space thereby speeding up the linkage process. This holds especially for the broad
class of token-based similarity joins where the comparison of records is based on the
set of tokens (e.g. q-grams) of QIDs. In this case, one can exclude all pairs of records
that do not share at least one token. Further proposed optimizations for such similar-
ity joins include the use of different kinds of filters (for example, length and prefix
filters) and dynamically created inverted indexes [10]. PPJoin [172] is an efficient
approach that includes these kinds of optimizations. Several filtering approaches
Privacy-Preserving Record Linkage for Big Data: Current Approaches … 871
also utilize the characteristics of similarity functions for metric spaces to reduce the
search space, in particular the so-called triangle inequality (see below) [174].
For PPRL, similar filtering (similarity join) approaches are usable but need to be
adapted to the comparison of masked records such as Bloom filters and the associated
similarity measures. For Bloom filters it is relatively easy to apply the known token-
based similarity measures by considering the set bit positions (1-bits) in the bit vectors
as the “tokens”. This has already been shown in Sect.3.5 for the Dice-coefficient
similarity which is based on the degree of overlapping bit positions. This is also the
case for the related Jaccard similarity. For two bit vectors b1and b2it is defined as
follows:
Jacc_si m(b1,b2)=|b1b2|
|b1b2|=|b1b2|
|b1|+|b2|−|b1b2|,(2)
where |bi|denotes the number of set bits in bit vector biwhich is also called its
length or cardinality. For the example Bloom filter pair shown in Fig.4, the Jaccard
similarity is 5/7=0.71. In the following, we outline several filtering approaches
that have been proposed for PPRL.
Length Filter: The similarity function Jacc_sim (as well as Di ce_sim) allows the
application of a simple length filter to reduce the search space. This is because the
minimal similarity (overlap of set bits) can only be achieved if the lengths (number of
set bits) of the two input records do not deviate too much. Formally, for two records
riand rjwith |ri|≤|rj|, it holds that
Jacc_si m(ri,rj)st⇒|ri|≥st·|rj| (3)
For example, two records cannot satisfy a (Jaccard) similarity threshold st=0.8if
their lengths differ by more than 20%. Hence for a similarity threshold of 0.8, the
length filter would already avoid the two records from the example Bloom filter pair
shown in Fig. 4without comparing in detail, since Eq.3(5 ≥0.8·7=6) does not
hold. The two-party PPRL approach proposed by Vatsalan and Christen uses such a
length filter for Dice-coefficient similarity [158].
PPJoin for PPRL: The privacy-preserving version of PPJoin (called P4Join) utilizes
three filters to reduce the search space: the length filter as well as a prefix filter and
a position filter [150]. The prefix filter is based on the fact that matching bit vectors
need a high degree of overlap in their set bit positions in order to satisfy a predefined
threshold. Pairs of records can thus be excluded from comparison if they have an
insufficient overlap. This overlap test can be limited to a relatively small sub-set
of bit positions, e.g. in the beginning (or prefix) of the vectors. To maximize this
filter idea, P4Join applies a pre-processing to count for each bit position the number
of records where this bit is set to 1 and reorders the positions of all bit vectors
in ascending order of these frequency counts. This way the prefixes of bit vectors
contain infrequently set bit positions reducing the likelihood of an overlap with other
872 D. Vatsalan et al.
bit vectors. The position filter of P4Join can avoid the comparison of two records
even if their prefixes overlap depending on the prefix positions where the overlap
occurs. For more details of this filter we refer to [150].
As we will see in the comparative evaluation below, the filtering approaches
achieve only a relatively small improvement for PPRL since the filter tests imply
already a certain overhead which is not much less than for the match tests (which
are relatively cheaper for Bloom filters). In addition, Bloom filter masking for PPRL
should ideally have 50% of their bits set to 1in order to make them less vulnerable
to frequency attacks [117], making P4Join less effective.
Multi-bit Trees: The use of multi-bit trees was originally proposed for fast similarity
search in large databases of chemical fingerprints (masked into Bloom filters) [98].
A query Bloom filter bqis being searched for in a database to retrieve all elements
whose similarity with bqis above the threshold st. A multi-bit tree is a binary tree that
iteratively assigns fingerprints to its nodes based on so-called match bits. A match
bit refers to a specific position of the bit vector and can be 1 or 0: it indicates that all
fingerprints in the associated subtree share the specified match bit. When building
up the multi-bit tree, one match bit or multiple such bits are selected in each step
so that the number of unassigned fingerprints can be roughly split by half. The split
is continued as long as the number of fingerprints per node does not fall under a
limit ([98] recommends a limit of 6). The match bits can then be used for a query
fingerprint to determine the maximal possible similarity for subtrees when traversing
the tree and can thereby eliminate many fingerprints to compare.
As suggested in [5], multi-bit trees can easily be applied for PPRL using Bloom
filters and Jaccard similarity. For two datasets, the larger input dataset is used to build
the multi-bit trees while each record (fingerprint) of the second dataset is used for
searching similar records. The multi-bit approach of [5] partitions the fingerprints
according to their lengths such that all fingerprints with the same length belong to the
same partition (or bucket). To apply the length filter, we can then restrict the search
for similar fingerprints to the partitions meeting the length criterion of Eq. 3. Query
efficiency is further improved by organizing all fingerprints of a partition within a
multi-bit tree.
In evaluations of [5,145] the multi-bit tree approach was found to be very effec-
tive and even better or similarly effective than blocking approaches such as canopy
clustering and sorted neighborhood.
PPRL for Metric Space Similarity Measures: A metric space consists of a set of
data objects and a metric to compute the distance between the objects. The main
property of interest that a metric or distance function dfor metric spaces has to
satisfy is the so-called triangle inequality. It requires that for any objects x,yand z
it holds
d(x,z)d(x,y)+d(y,z)(4)
Privacy-Preserving Record Linkage for Big Data: Current Approaches … 873
Fig. 5 Triangle inequality:
Object ycannot lie within
the search radius of query
object qsince the difference
between d(p,q)and d(p,y)
exceeds rad(q)(taken
from [149]) pq
x
rad(q)
y
d(p,q)
Distance functions for metric spaces satisfying this property include the
Minkowski distances (for example, Euclidean distance), edit distance, Hamming dis-
tance and Jaccard-coefficient (but not Dice-coefficient) [174]. The triangle inequal-
ity has been used for private comparison and classification in PPRL using reference
values [124,162].
The triangle inequality has also been used to reduce the search space for similarity
search and record linkage [7,12]. In both cases we have to find for a query object
qthose similar objects xwith a distance d(q,x)lower than or equal to a maximal
distance threshold (or above a minimal similarity threshold) which can be seen as a
radius rad(q)around qin Fig.5. The triangle equality allows one to avoid comput-
ing the distance between two objects based on their precomputed distances to a third
reference object or pivot, such as object pin Fig. 5. Utilizing the precomputed dis-
tances d(p,q)and d(p,x)we only have to compute the distance d(q,x)for objects
xthat satisfy the triangle inequality d(p,q)d(p,x)rad(q). In all other cases,
comparison can be avoided such as for object yin Fig. 5.
Several alternatives to utilize the triangle inequality to reduce the search space for
PPRL have been studied in [149], in particular for the Hamming distance which has
been shown to be equivalent to the Jaccard similarity [172]. The best performance
was achieved for a pivot-based approach that selects a certain number of data objects
from a sample of the first dataset as pivots and assigns each other object of the
first dataset to its closest pivot. For each pivot, the maximal distance (radius) for
its objects is also recorded. Pivots are iteratively determined from the sample set of
objects such that the object with the greatest distance to all previously determined
pivots becomes the next pivot. The rational behind this selection strategy is to have
a relatively large distance between pivots so that searching for similar objects can be
restricted to objects of relatively few pivots. Determining the pivots from a sample
rather than from all objects limits the overhead of pivot selection. The search for
similar (matching) objects can be restricted to the pivots for which there is a possible
overlap with the radius of the query objects. For the objects of the relevant pivots the
triangle inequality is further used to prune objects from the comparison.
Comparative Evaluation: The performance of pivot-based approaches for metric
similarity measures has been evaluated in [149] and compared with the use of P4Join
and multi-bit trees. The evaluation has been done for synthetically generated datasets
874 D. Vatsalan et al.
Table 3 PPRL runtime in minutes for different dataset sizes and filtering approaches (taken
from [149])
Algorithms Datasets
100,000 200,000 300,000 400,000 500,000
NestedLoop 3.820.852.196.8152.6
Multi-bit Tree 2.611.326.550.075.9
P4Join 1.4 7.424.152.387.9
Pivots (Metric
Space)
0.2 0.4 0.9 1.3 1.7
of 100,000–500,000 records such that 80% of the records are in the first dataset and
20% in the second. Bloom filters of length 1,000 bits are used to mask the QIDs of
records, and the comparison is based on a Jaccard similarity threshold of 0.8orthe
corresponding Hamming distance for the metric-space approach.
Table 3summarizes the runtimes of the different approaches as well as for a naïve
nested loop approach without any filtering (all implemented using Java) on a standard
PC (Intel i7-4770, 3.4 GHz CPU with 16 GB main memory). The results show that
both multi-bit trees and P4Join perform similarly but achieve only modest improve-
ments (less than a factor of 2) compared to the naïve nested loop scheme. By contrast
the pivot-based metric space approach achieves order-of-magnitude improvements.
For the largest dataset it only needs 1.7min and is 40 times faster than using multi-bit
trees. A general observation for all approaches is that the runtimes increase more than
linearly (almost quadratically) with the size of datasets, indicating a potential scala-
bility problem despite the reduction of the search space. Substantially larger datasets
would thus create performance problems even for the best filtering approach indicat-
ing that additional runtime improvements are necessary, e.g. by the use of parallel
PPRL.
4.3 Parallel PPRL
PPRL in Big Data applications involves the comparison of a large number of masked
records as the main part of the overall execution pipeline. Parallel linkage on multiple
processors aims at improving the execution time proportionally to the number of
processors [42,90,91]. This can be achieved by partitioning the set of all record
pairs to be compared, and conducting the comparison of the different partitions in
parallel on different processors. A special case would be to utilize a blocking approach
to compare the records in different blocks in parallel. In the following we discuss
two approaches for parallel PPRL that have been proposed: one utilizes graphics
processors or GPUs for parallel processing within a single machine, and the other
one is based on Hadoop and its MapReduce framework. Both approaches have also
been used for general record linkage.
Privacy-Preserving Record Linkage for Big Data: Current Approaches … 875
Parallel PPRL with GPUs: The utilization of Graphical Processing Units (GPUs) to
speed-up similarity computations is a comparatively new approach [56,120]. Modern
GPUs provide thousands of cores that allow for a massively-parallel application of
the same instruction set to disjoint data partitions. The availability of frameworks like
OpenCL and CUDA simplify the utilization of GPUs to parallelize general purpose
algorithms. The GPU programs (called kernels) are typically written in a dialect of
the general programming language C. Kernel execution requires the input and output
data to be transferred between the main memory of the host system and the memory
of the GPU, and it is important to minimize the amount of data to be transferred.
Further limitations are that there is no dynamic memory allocation on GPUs (all
resources required by a program need to be allocated a priori) and that only basic
data types (e.g., int, long, float) and fixed-length data structures (e.g., arrays) can be
used.
Despite such limitations, the utilization of GPUs is a promising approach to speed-
up PPRL. This is especially the case for Bloom filter masking where all records are
represented as bit vectors of equal length. These vectors can easily be stored in array
data structures on the GPU. Furthermore, similarity computations can be broken
down into simple bit operations which are easily processed by GPUs.
A GPU-based implementation for PPRL using the P4Join filtering is described
in [150]. It sorts the bit vectors of the two input datasets initially according to their
number of set bits (1-bits) and partitions the set of bit vectors into equal-sized blocks
such that multiple of such blocks fit into the GPU memory. Pairs of blocks are then
continuously loaded into the GPU for parallel comparison. To limit unnecessary data
transfers, the length filter (described in Sect. 4.2) is applied to avoid transferring pairs
of blocks that do not meet the length filter restriction. The kernel programs also apply
the prefix filter to save comparisons.
The evaluation in [150] showed that the GPU implementation is highly efficient
and improves runtimes by a factor of 20, even for a low-profile graphics card (Nvidia
GeForce GT 540 M with 96 CUDA cores@672 MHz, 1GB memory). It would be
interesting to realize GPU versions of other PPRL approaches and to utilize more
powerful graphics cards with thousands of cores for improved performance.
Hadoop-based Parallel PPRL: Many Big Data applications are based on local
Shared Nothing clusters driven by software from the open-source Hadoop ecosys-
tem for parallel processing. Depending on the data volume and needed degree of
parallelism up-to thousands of multi-processor nodes are utilized. A main reason for
the success of Hadoop is that its programming frameworks, in particular MapRe-
duce and newer platforms such as Apache Spark4or Apache Flink,5make it easy to
develop programs that can be automatically executed in parallel on Hadoop clusters.
Several approaches and implementations have utilized MapReduce for parallel
record linkage [93,166]. In its simplest form, the Map tasks read the input data in
parallel and apply a blocking key to assign each record to a block. Then the data
records are dynamically redistributed among the Reduce tasks such that all records
4http://spark.apache.org [Accessed: 15/06/2016].
5https://flink.apache.org/ [Accessed: 15/06/2016].
876 D. Vatsalan et al.
Fig. 6 Parallel PPRL with MapReduce using LSH blocking [82], where the MinHash keys for
Bloom filters are computed in the Map phase and records with the same MinHash signature will be
sent to the same Reduce task for matching
with the same blocking key are sent to the same Reduce task. Comparison is then
performed block-wise and in parallel by the Reduce tasks. For highly skewed block
sizes this simple approach can result in load balancing problems for the Reduce tasks;
approaches to solve this data skew or load balancing problem are proposed in [94].
The sketched approach can in principle also be applied for parallelizing PPRL,
e.g., if the linkage unit utilizes a Hadoop cluster. One such approach for using MapRe-
duce to speed-up PPRL has been proposed in [82]. The authors apply a LSH-based
blocking method using the MinHash approach (see Sect. 4.1). The use of MapReduce
is rather straightforward and illustrated in Fig. 6. The Bloom filters of both sources
are initially stored in the distributed file system (HDFS) as chunks. In the Map phase,
records are read sequentially and for each Bloom filter ra set of jMinHash keys are
computed. The records are then redistributed so that records with the same key are
sent to the same Reduce task for comparison. The main drawback of this strategy is
that records may be compared several times by different Reduce tasks because they
could share many keys (as shown in Fig.6for records r2and s1). To overcome this
problem the authors proposed another strategy by chaining two MapReduce jobs,
where the first one is similar to the described method except that the Reduce phase
only outputs the pairs of records’ identifiers instead of comparing the records. In the
second MapReduce job, duplicate records pairs are grouped at the same Reducer to
be compared only once. In this process, the Bloom filters are not redistributed (but
only their identifiers) by storing the Bloom filters in a relational database from where
they are read when needed. The evaluation of this parallel LSH approach in [82]was
Privacy-Preserving Record Linkage for Big Data: Current Approaches … 877
limited to only 2 and 4 nodes and small datasets (about 300,000 records) so that the
overall scalability of the approach remains unclear.
For future work, it would be valuable to investigate and compare different parallel
PPRL approaches utilizing the Hadoop ecosystem. The approaches could also utilize
the Spark or Flink frameworks which support more operators than only Map and
Reduce, and support efficient distributed in-memory processing.
5 Multi-party PPRL
While there have been many different approaches proposed for PPRL [164], most
work thus far has concentrated on linking records from only two databases (or par-
ties). Only some approaches have investigated linking records from three or more
databases [75,104,118,122,127,130,131,160], with most of these being limited to
exact matching or matching of categorical data only, as will be discussed below. How-
ever, as the example applications described in Sect. 1have shown, linking data from
multiple databases is increasingly being required for several Big Data applications.
In the following, we describe existing techniques of multi-party private blocking and
private comparison and classification for multi-party PPRL (MP-PPRL).
5.1 Multi-party Private Blocking Techniques
Private blocking for MP-PPRL is crucial due to the exponential growth of the com-
parison space with the number of databases linked. However, this has not been studied
until recently, making MP-PPRL not practical in real applications.
Tree-based approaches: The first approach [130] is based on a single-bit tree
(adapted from multi-bit tree [98]) data structure, which is constructed iteratively
to arrange records (masked into Bloom filters) such that similar records are placed
into the same tree leaf while non-similar records are placed into different leaf nodes
in the tree. At each iteration, the set of Bloom filters in a tree node is recursively
split based on selected (according to a privacy criteria) bit positions which are agreed
upon by all parties. A drawback with this approach, however, is that it might miss
true matches due to the recursive splitting of Bloom filters. Furthermore, a commu-
nication step is required among all parties for each iteration.
This limitation of missing true matches in the single-bit tree-based approach [130]
has been addressed in [131] using a multi-bit tree [98] data structure (as we discussed
in Sect. 4.2) that is combined with canopy clustering. Multi-bit tree-based filtering for
PPRL of two databases was first introduced by Schnell [144]. In [131] the concept of
multi-bit trees was used to split the databases (masked into Bloom filters) individually
by the parties into small mini-blocks, which are then merged into larger blocks
878 D. Vatsalan et al.
0110011
1011011
1101110
1101010
RA1
RA2
RA3
RA4
1 01011
1011011
1101010
1BR0
RB2
RB3
110110
1011011
0100110
0101010
1RC1
RC2
RC3
RC4
<2, 3, 4, 5, 6>
<1, 5, 2, 7, 6>
CB1
CB2
<2, 3, 4, 5, 6>
<9, 8, 1, 2, 7>
CC1
CC2
<1, 5, 2, 7, 6>
32
Party PParty P
1
Party P
1100110
1011011
1101110
1101010
CA1
CA2
1 01011
1011011
0
1101010
CB1
CB2
110110
1011011
1
0100110
0101010
CC1
CC2
Local blocks
CA1, CB1, CC1
CA2, CB2
CA1, CB1, CC1
CA2, CB2
Bloom filters
Minhash
signatures
CA1
CA2
<2, 3, 4, 5, 6>
Linkage unit
LSH
Candidate
CA1, CB1, CC1
block sets
Fig. 7 Multi-party private blocking approach as proposed by Ranbaduge et al. [133] (adapted
from [133]). Candidate block sets from all three parties (CA1,CB1,CC1) and sub-set of two
parties (CA2,CB2) are identified to be compared and classified
according to privacy and computational requirements based on their similarity using
a canopy clustering technique [40].
Linkage unit-based approaches: A communication-efficient approach for
multi-party private blocking by using a linkage unit was recently proposed [133],
as illustrated in Fig. 7. In the first step of this approach, local blocks are generated
individually by each party using a private blocking technique (which is considered
to be a black box). For example, the private blocking approach based on multi-bit
tree and canopy clustering [131] (described above) can be used for local blocking.
A block representative in the form of a min-hash signature [18] is then generated
for each block and sent to a linkage unit. The linkage unit applies global blocking
using locality sensitive hashing (LSH) to identify the candidate block sets from all
parties or from sub-sets of parties based on the similarity between block representa-
rahm@informatik.uni-leipzig.de
Privacy-Preserving Record Linkage for Big Data: Current Approaches … 879
tives. Local blocking provides the database owners with more flexibility and control
over their blocks while eliminating all communications among them. This approach
outperforms existing multi-party private blocking approaches in terms of scalability,
privacy, and blocking quality, as validated by a set of experiments conducted by the
authors [133].
Karapiperis and Verykios recently proposed a multi-party private blocking
approach based on LSH [84]. This approach uses Lindependent hash tables (or
blocking groups), each of which consists of key-bucket pairs where keys represent
the blocking keys and buckets host a linked list aimed at grouping similar records that
were previously masked into Bloom filters. Each hash table is assigned with a set of K
hash functions which is generated by a linkage unit and sent to all the database owners
to populate their set of blocks accordingly. The same authors extended this approach
by proposing a frequent pairs scheme (FPS) [85] for further reducing the number of
comparisons while maintaining a high level of recall. This approach achieves high
blocking quality by identifying similar record pairs that exhibit a number of LSH
collisions above a given threshold, and then performs distance calculations only for
those similar pairs. Empirical results showed significant improvement in running
time due to a drastic reduction of candidate pairs by the FPS, while achieving high
blocking quality [85].
A major drawback of these multi-party private blocking techniques is that they
still result in an exponential comparison space with an increasing number of data-
bases to be linked, especially when the databases are large. Therefore, efficient
communication patterns, such as ring-based or tree-based [113,142], as well as
advanced filtering techniques, such as those discussed in Sect.4.2, need to be inves-
tigated for multi-party PPRL in order to make PPRL scalable and viable in Big Data
applications.
5.2 Multi-party Private Comparison and Classification
Techniques
Several private comparison and classification techniques for MP-PPRL have been
developed in the literature. However, they fall short in providing a practical solution
either because they allow exact matching only or theyare computationally not feasible
with the size and number of multiple databases. In the following we describe these
approaches and their drawbacks.
Secure Multi-party Computation (SMC)-based approach: An approach based
on SMC using an oblivious transfer protocol was proposed in [122] for multi-party
private comparison and classification. While provably secure, the approach only
performs exact matching of masked records and it is computationally expensive
compared to efficient perturbation-based privacy techniques such as Bloom filters
and k-anonymity [164].
880 D. Vatsalan et al.
Generalization-based approaches: A multi-party private comparison and classifi-
cation approach was introduced in [75] to perform secure equi-join of masked records
from multiple k-anonymous databases by using a linkage unit. The database records
are k-anonymised by the database owners and sent to a linkage unit. The linkage
unit then compares and classifies records by applying secure equi-join, which allows
exact matching only.
Another multi-party private comparison and classification approach based on k-
anonymity for categorical values was proposed in [118]. In this approach, a top-
down generalization is performed on the QIDs to provide k-anonymous privacy (as
discussed in Sect. 3.4) and the generalized blocks are then classified into matches
and non-matches using the C4.5 decision tree classifier.
Probabilistic data structure-based approaches: An efficient multi-party private
comparison and classification approach for exact matching of masked records using
Bloom filters was introduced by Lai et al. [104], as illustrated in Fig. 8. Each party
hash-maps their record values into a single Bloom filter and then partitions its Bloom
filter into segments according to the number of parties involved in the linkage. The
segments are exchanged among the parties such that each party receives a corre-
sponding Bloom filter segment from all other parties. The segments received by a
party are combined using a conjunction (logical AND) operation. The resulting con-
juncted Bloom filter segments are then exchanged among the parties to generate the
full conjuncted Bloom filter. Each party compares its Bloom filter of each record with
the final conjuncted Bloom filter. If the membership test of a record’s Bloom filter is
successful then the record is considered to be a match across all databases. Though
the computation cost of this approach is low since the computation is completely
distributed among the parties without a linkage unit and the creation and processing
of Bloom filters are very fast, the approach can only perform exact matching.
P
3
p
1P
2
123456789
123456789
(2, 5, 9)(3, 6, 8)(1, 4, 8) robertmillerpeter
peter
peter
peter
miller
robert
110110 10 1
robert
110110 10 1
11 1
10000
11110 0 010
110101 1
011001
0101
01
(AND)
b3
b2
b1
b
1
b2
b3
111100010
456789231
34567892
final
result
Distributed conjunctionMapping of QID values
1
Exact matching
(nonmatch)(match) (nonmatch)
− QID value (1−bits)
Fig. 8 Bloom filter masking-based exact matching approach for MP-PPRL as proposed by Lai et
al. [104] (adapted from [164])
Privacy-Preserving Record Linkage for Big Data: Current Approaches … 881
11 1
10 1 0 1
10110011
110101 1
011000
01 0
01
(AND)
c = 2c = 1 c = 1
123
Dice_sim = 123
123
=(6+6+5)
= 0.706
1
x = 6
x = 6
2
x = 5
3
(x +x +x )
3(c +c +c ) 3(1+2+1)
1−bits
Num common
Num 1−bits
b3
b2
b
11
0
12
P
3
pP
Fig. 9 Bloom filter masking-based approximate matching approach for MP-PPRL proposed by
Vatsalan and Christen [160] (adapted from [160])
Another efficient multi-party approach for private comparison and classification of
categorical data was recently proposed [80] using a Count-Min sketch data structure
(as described in Sect. 3.4). Sketches are used to summarize records individually by
each database owner, followed by a secure intersection of these sketches to provide a
global synopsis that contains the common records across parties and their frequencies.
The approach uses homomorphic operations, secure summation, and symmetric noise
addition privacy techniques.
Developing privacy-preserving approximate string comparison functions for mul-
tiple (more than two) values has only recently been considered [160]. This MP-
PPRL approach adapts Lai et al.’s Bloom filter-based exact matching approach [104]
(as described above) for approximate matching to distributively calculate the Dice-
coefficient similarity of a set of Bloom filters from different parties using a secure
summation protocol. This approach is illustrated in Fig. 9. The Dice-coefficient of P
Bloom filters (b1,...,bP) is calculated as:
Dice_sim(b1,...,bP)=P×c
P
i=1xi
=P×P
i=1ci
P
i=1xi
,(5)
where ciis the number of common bit positions that are set to 1 in ith Bloom filter
segment from all Pparties such that c=P
i=1ci, and xiis the number of bit positions
set to 1 in bi(1-bits), where x=P
i=1xiand 1 iP.
Similar to Lai et al.’s approach [104], the Bloom filters are split into segments
such that each party receives a certain segment of the Bloom filters from all other
parties. A logical conjunction is applied to calculate ciindividually by each party Pi
(with 1 iP) which are then summed to calculate cusing a secure summation
protocol. A secure summation of xiis also performed to calculate x. These two sums
are then used to calculate the Dice-coefficient similarity of the Bloom filters using
Eq. 5. A limitation of this approach is that it can only be used to link a small number
of databases due to its large number of logical conjunction calculations (even when
a private blocking technique is used).
882 D. Vatsalan et al.
Therefore, more work needs to be done in multi-party private comparison and clas-
sification to enable efficient and effective PPRL on multiple large databases including
sub-set matching (i.e. identifying matching records across sub-set of parties).
6 Open Challenges
In this section we first describe the various open challenges of PPRL, and then discuss
these challenges in the context of the four V’s volume,variety,velocity, and veracity
of Big Data.
6.1 Improving Scalability
The trend of Big Data growth dispersed in multiple sources challenges PPRL in terms
of complexity (volume), which increases exponentially with multiple large databases.
Much research in recent years has focused on improving the scalability of the PPRL
process, both with regard to the sizes of the databases to be linked, as well as with
the number of databases to be linked. While significant progress has been made in
both these directions, further efforts are required to make all aspects of the PPRL
process scalable. Both directions are highly relevant for Big Data applications.
Even small blocks can still lead to a large number of record pair (or set) com-
parisons that are required in the comparison step, especially when databases from
multiple (more than two) sources are to be linked. For each set of blocks across sev-
eral parties, potentially all combinations of record sets need to be compared. For a
block that contains Brecords from each of Pparties, BPcomparisons are required.
Crucial are efficient adaptive comparison techniques that stop the comparison of
records across parties once a pair of records has been classified to be a non-match
between two parties. For example, assume the record set rA,rB,rC,rD, where rA
is from party A,rBis from party B, and so on. Once the pair rAand rBare compared
and classified as a non-match, there is no need to compare all other possible record
pairs (rAwith rC,rAwith rD,rBwith rC, and so on) if the aim of the linkage is to
identify sets of records that match across all parties involved in a PPRL.
A very challenging aspect is the task of identifying sub-sets of records that match
across only a sub-set of parties. An example is to find all patients that have medical
records in the databases of any three out of a group of five hospitals. In this situa-
tion, all potential sub-sets of records need to be compared and classified. This is a
challenging problem with regard to the number of comparisons required and has not
been studied in the literature so far.
Privacy-Preserving Record Linkage for Big Data: Current Approaches … 883
6.2 Improving Linkage Quality
The veracity and variety aspects (errors and variations) of Big Data need to be
addressed in PPRL by developing accurate and effective comparison and classifica-
tion techniques for high linkage quality. How to efficiently calculate the similarity of
more than two values using approximate comparison functions in PPRL is an impor-
tant challenge with multi-source linking. Most existing PPRL solutions for multiple
parties only support exact matching [80,104] or they are applicable to QIDs of
only categorical data [75,118]. Thus far only one recent approach supports approx-
imate matching of string data for PPRL on multiple databases [160] (as described in
Sect. 5.2).
In the area of non-PPRL, advanced collective [13] and graph-based [58,74]
classification techniques have been developed in recent times. These techniques are
able to achieve high linkage quality compared to the basic pair-wise comparison
and threshold-based classification approach that is often employed in most PPRL
techniques. Group linkage [123] is the only advanced classification technique that
has so far been considered for PPRL [105].
For classification techniques that require training data (i.e. supervised classifiers),
a major challenge in PPRL is how such training data can be generated. Because of
privacy and confidentiality concerns, in PPRL it is generally not possible to gain
access to the actual sensitive QID values (to decide if they refer to a true match or
a true non-match). The advantage of certain collective and graph-based approaches
[13,74] is that they are unsupervised and therefore do not require training data.
However, their disadvantage is their high computational complexities (quadratic or
even higher) [137]. Investigating and adapting advanced classification techniques
for PPRL will be a crucial step towards making PPRL useful for practical Big Data
applications, where training data are commonly not available, or are expensive to
generate.
6.3 Dynamic Data and Real-Time Matching
All PPRL techniques developed so far, in line with most non-PPRL techniques, only
consider the batch linkage of static databases. However, a major aspect of Big Data
is the dynamic nature of data (velocity) that requires adaptive systems to link data
as they arrive at an organization, ideally in (near) real-time. Limited work has so
far investigated temporal data [33,107] and real-time [32,70,129] matching in
the context of record linkage. Temporal aspects can be considered by adapting the
similarities between records depending upon the time difference between them, while
real-time matching can be achieved using sophisticated adaptive indexing techniques.
Several works have been done on dynamic privacy-preserving data publishing on
the cloud by developing an efficient and adaptive QID index-based approach over
incremental datasets [175,176].
884 D. Vatsalan et al.
Linking dynamic databases in a PPRL context opens various challenging research
questions. Existing masking (encoding) methods used in PPRL assume static data-
bases that allow parameter settings to be calculated a-priori leading to secure masking
of QID values. For example, Bloom filters in average should have 50% of their bits
set to 1, making frequency attacks more difficult [117]. Such masking might not
stay secure as the characteristics of data are changing over time. Dynamic databases
also require novel comparison functions that can adapt to changing data as well as
adaptive masking techniques.
6.4 Improving Security and Privacy
In addition to the four V’s of Big Data, another challenging aspect that needs to
be considered for Big Data applications is security and privacy. As we discussed
in Sect. 3.2, most work in PPRL assumes the honest-but-curious (HBC) adversary
model [65,111]. Most PPRL protocols also assume that the parties do not collude
with each other (i.e. a sub-set of two or more parties do not collaborate with the
aim to learn sensitive information of another party) [111]. However, in a commercial
environment and in PPRL scenarios where many parties are involved, such as is
likely in Big Data applications, collusion is a real possibility that needs to be pre-
vented. Only few PPRL techniques consider the malicious adversary model [164].
The techniques developed based on this security model commonly have high compu-
tational complexities and are therefore currently not practical for the linkage of large
databases. Therefore, because the HBC model might not be strong enough while
the malicious model is computationally too expensive, novel security models that
lie between those two need to be investigated for PPRL. Two of these are the covert
adversary model [4] and accountable computing [71], which have been discussed in
Sect. 3.2. Research directions are required to develop new protocols that are practical
and at the same time more secure than protocols based on the HBC model.
With regard to privacy, most PPRL techniques are known to leak some information
during the exchange of data between the parties (such as the number and sizes of
blocks, or the similarities between compared records). How sensitive such revealed
information is for a certain dataset heavily depends upon the parameter settings used
by a protocol. Sophisticated attack methods [101] have been developed that exploit
the subtle pieces of information revealed by certain PPRL protocols to iteratively
gather information about sensitive values. Therefore, there is a need to harden existing
PPRL techniques to ensure they are not vulnerable to such attacks. Preserving privacy
of individual entities is more challenging with multi-party PPRL due to the increasing
risk of collusion between a sub-set of parties which aim to learn about another (sub-set
of) party’s private data. Distributing computations among pairs or groups of parties
can reduce the likelihood of collusion between parties if individual pairs or groups
can use different secret keys (known only to them) for masking their values.
Most PPRL techniques have mainly been focusing on the privacy of the individual
records that are to be linked [165]. However, besides individual record privacy, the
Privacy-Preserving Record Linkage for Big Data: Current Approaches … 885
privacy of a group of individuals also needs to be considered. Often the outcomes of
a PPRL project are sets of linked records that represent people with certain charac-
teristics (such as certain illnesses, or particular financial circumstances). While the
names, addresses and other personal details of these people are not revealed during or
after the PPRL process, their overall characteristics as a group could potentially lead
to the discrimination of individuals in this group if these characteristics are being
revealed. The research areas of privacy-preserving data publishing [59] and statistical
confidentiality [45] have been addressing these issues from different directions.
PPRL is only one component in the management and analysis of sensitive, person-
related information by linking different datasets in a privacy-preserving manner.
However, achieving an effective overall privacy preservation needs a comprehensive
strategy regarding the whole data life cycle including collection, management, pub-
lishing, exchange and analysis of data to be protected (‘privacy-by-design’) [22].
Hence, it is necessary to better understand the role of PPRL in the life cycle for
sensitive data to ensure that it can be applied and that the match results are both
useful and privacy-preserving.
In research, the different technical aspects to preserve privacy have partially been
addressed by different communities with little interaction. For example, there is a
large body of research on privacy-preserving data publishing [59] and on privacy-
preserving data mining [109,156] that have been largely decoupled from the research
on PPRL. It is well known that data analysis may identify individuals despite the
masking of QID values [152]. Hence, there is similar risk that the combined infor-
mation of matched records together with some background information could lead
to the identification of individuals (known as re-identification). Such risks must be
evaluated and addressed within a comprehensive privacy strategy including a closely
aligned PPRL and privacy-preserving data analysis/mining approach.
6.5 Evaluation, Frameworks, and Benchmarks
How to assess the quality (how many classified matches are true matches) and com-
pleteness (how many true matches have been classified as matches) of the records
linked in a PPRL project is very challenging because it is generally not possible
to inspect linked records due to privacy concerns. Manual assessment of individual
records would reveal sensitive information which is in contradiction to the objective
of PPRL. Not knowing how accurate and complete linked data are is however a major
issue that will render any PPRL protocol impractical in applications where linkage
completeness and quality are crucial, as is the case in many Big Data applications
such as in the health or security domains.
Recent initial work has proposed ideas and concepts for interactive PPRL [100]
where parts of sensitive values are revealed for manual assessment. How to actually
implement such approaches in real applications, while ensuring the revealed infor-
mation is limited to a certain level of detail (for example providing k-anonymous
privacy for a certain value of k>1[152]) is an open research question that must be
886 D. Vatsalan et al.
solved. Interactive manual evaluation might also not be feasible in Big Data appli-
cations where the size and dynamic nature of data, as well as real-time processing
requirements, prohibit any manual inspection.
With regard to evaluating the privacy protection that a given PPRL technique
provides, unlike for measuring linkage quality and completeness (where standard
measurements such as runtime, reduction ratio, pairs completeness, pairs quality,
precision, recall, or accuracy are available [28]), there are currently no standard
measurements for assessing privacy in PPRL. Different measurements have been
proposed and used [46,164,165], making the comparison of different PPRL tech-
niques difficult. How to assess linkage quality and completeness, as well as privacy,
are must-solve problems as otherwise it will not be possible to evaluate the efficiency,
effectiveness, and privacy protection of PPRL techniques in real-world applications,
leaving these techniques non-practical.
An important direction of future work for PPRL is the development of frameworks
that allow the experimental comparison of different PPRL techniques with regard
to their scalability, linkage quality, and privacy preservation. No such framework
currently exists. Ideally, such frameworks allow researchers to easily ‘plug-in’ their
own algorithms such that over time a collection of PPRL algorithms is compiled that
can be tested and evaluated by researchers, as well as by practitioners to allow them
to identify the best technique to use for their application scenario.
An issue related to frameworks is the availability of publicly available benchmark
datasets for PPRL. While this is not a challenge limited to PPRL but to record linkage
research in general [28,95], it is particularly prominent for PPRL as it deals with
sensitive and confidential data. While for record linkage techniques publicly available
data from bibliographic or consumer product databases might be used [95], such data
are less useful for PPRL research as they have different characteristics compared to
personal data. The nature of the datasets to be linked using PPRL techniques is
obviously in strong contradiction to them being made public. Ideally researchers
working in PPRL are able to collaborate with practitioners that do have access to
real sensitive and confidential databases to allow them to evaluate their techniques
on such data.
A possible alternative to using benchmark datasets is the use of synthetic data that
are generated based on the characteristics of real data using data generators [34,153].
Such generators must be able to generate data with similar distribution of values, vari-
ations, and errors as would be expected in real datasets from the same domain. Several
such data generators have been developed and are used by researchers working in
PPRL as well as record linkage in general.
6.6 Discussion
As we have discussed in this section, there are various challenges that need to be
addressed in order to make PPRL practical for applications in a variety of domains.
Privacy-Preserving Record Linkage for Big Data: Current Approaches … 887
Some of these challenges are general and not just affect PPRL for Big Data, others
are specific to certain types of applications, including those in the Big Data space.
The challenge of scalability of PPRL towards very large databases is highly rel-
evant to the volume of Big Data, while the challenge of linkage quality of PPRL
is highly relevant to the veracity and variety of Big Data. The dynamic nature of
data in many Big Data applications, and the requirement of being able to link data
in real-time, are challenging all aspects of PPRL, as well as record linkage in gen-
eral [129]. This challenge corresponds to the velocity of Big Data and it requires the
development of novel techniques that are adaptive to changing data characteristics,
and that are highly efficient with regard to fast linking of streams of query records.
While the volume,variety, and veracity aspects of Big Data have been studied for
PPRL to some extent, the velocity aspect has so far not been addressed in a PPRL
context.
Making PPRL more secure and more private is challenged by all four V’s of Big
Data. Larger data volume likely means that only encoding techniques that require
little computational efforts per record can be employed, while dynamic data (velocity)
means such techniques have to be adaptable to changing data characteristics. Var i e t y
means PPRL techniques have to be made more secure and private for various types
of data, while veracity requires them to also take data uncertainties into account. The
challenge of integrating PPRL into an overall privacy-preserving approach has also
not seen any work so far. All four V’s of Big Data will affect the overall efficiency and
effectiveness of systems that enable the management and analysis of sensitive and
confidential information in a privacy-preserving manner. The more basic challenges
of improving scalability, linkage quality, privacy and evaluation need to solved first
before this more complex challenge of an overall privacy-preserving system can be
addressed.
The final challenge of evaluation is affected by all aspects of Big Data. Improved
evaluation of PPRL systems requires that databases that are large, heterogeneous,
dynamic, and that contain uncertain data, can be handled and evaluated efficiently
and accurately. So far no research in PPRL has investigated evaluation specifically
for Big Data. While the lack of general benchmarks and frameworks is already a gap
in PPRL and record linkage research in general, Big Data will make this challenge
even more pronounced. Compared to frameworks that can handle small and medium
sized static datasets only, it is even more difficult to develop frameworks that enable
privacy-preserving linking of very large and dynamic databases, as is making such
datasets publicly available. No work addressing this challenge in the context of Big
Data has been published.
7 Conclusions
Privacy-preserving record linkage (PPRL) is an emerging research field that is being
required by many different applications to enable effective and efficient linkage of
databases across different organizations without compromising privacy and confiden-
tiality of the entities in these databases. In the Big Data era, tremendous opportunities
888 D. Vatsalan et al.
can be realized by linking data at the cost of additional challenges. In this chapter, we
have provided background material required to understand the applications, process,
and challenges of PPRL, and we have reviewed existing PPRL approaches to under-
stand the literature. Based on the analysis of existing techniques, we have discussed
several interesting and challenging directions for future work in PPRL for Big Data.
With the increasing trend of Big Data in organizations, more research is required
towards the development of techniques that allow for multiple large databases to be
linked in privacy-preserving, effective, and efficient ways, thereby facilitating novel
ways of data analysis and mining that currently are not feasible due to scalability,
quality, and privacy-preserving challenges.
Acknowledgements This work was partially funded by the Australian Research Council under
Discovery Project DP130101801, the German Academic Exchange Service (DAAD) and Universi-
ties Australia (UA) under the Joint Research Co-operation Scheme, and also funded by the German
Federal Ministry of Education and Research within the project Competence Center for Scalable
Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF 01IS14014B).
References
1. R. Agrawal, A. Evfimievski, R. Srikant, Information sharing across private databases, in ACM
SIGMOD (2003), pp. 86–97
2. A. Arasu, V. Ganti, R. Kaushik, Efficient exact set-similarity joins, in PVLDB (2006), pp.
918–929
3. A. Arasu, M. Götz, R. Kaushik, On active learning of record matching packages, in ACM
SIGMOD (2010), pp. 783–794
4. Y. Aumann, Y. Lindell, Security against covert adversaries: efficient protocols for realistic
adversaries. J. Cryptol. 23(2), 281–343 (2010)
5. T. Bachteler, J. Reiher, and R. Schnell. Similarity Filtering with Multibit Trees for Record
Linkage. Technical Report WP-GRLC-2013-01, German Record Linkage Center, 2013
6. D. Barone, A. Maurino, F. Stella, C. Batini, A privacy-preserving framework for accuracy
and completeness quality assessment, in Emerging Paradigms in Informatics, Systems and
Communication (2009), pp. 83–87
7. J.E. Barros, J.C. French, W.N. Martin, P.M. Kelly, T.M. Cannon, Using the triangle inequality
to reduce the number of comparisons required for similarity-based retrieval, in Electronic
Imaging Science and Technology (1996), pp. 392–403
8. C. Batini, M. Scannapieca, Data quality: Concepts, Methodologies And Techniques. Data-
Centric Systems and Applications (Springer, Berlin, 2006)
9. R. Baxter, P. Christen, T. Churches, A comparison of fast blocking methods for record linkage,
in SIGKDD Workshop on Data Cleaning, Record Linkage and Object Consolidation (2003),
pp. 25–27
10. R.J. Bayardo, Y. Ma, R. Srikant, Scaling Up All Pairs Similarity Search, in WWW (2007), pp.
131–140
11. K. Bellare, S. Iyengar, A.G. Parameswaran, V. Rastogi, Active sampling for entity matching,
in ACM SIGKDD (2012), pp. 1131–1139
12. A. Berman, L.G. Shapiro, Selecting good keys for triangle-inequality-based pruning algo-
rithms, in IEEE Workshop on Content-Based Access of Image and Video Database (1998),
pp. 12–19
13. I. Bhattacharya, L. Getoor, Collective entity resolution in relational data. ACM TKDD 1(1),
1–35 (2007)
Privacy-Preserving Record Linkage for Big Data: Current Approaches … 889
14. M. Bilenko, R.J. Mooney, Adaptive duplicate detection using learnable string similarity mea-
sures, in ACM SIGKDD (2003), pp. 39–48
15. B. Bloom, Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7),
422–426 (1970)
16. L. Bonomi, L. Xiong, R. Chen, B. Fung, Frequent grams based embedding for privacy pre-
serving record linkage, in ACM CIKM (2012), pp. 1597–1601
17. H. Bouzelat, C. Quantin, L. Dusserre, Extraction and anonymity protocol of medical file, in
AMIA Fall Symposium (1996), pp. 323–327
18. A.Z. Broder, On the resemblance and containment of documents, in Compression and Com-
plexity of Sequences. IEEE (1997), pp. 21–29
19. A. Broder, M. Mitzenmacher, A. Mitzenmacher, Network applications of Bloom filters: a
survey. Internet Math. 1(4), 485–509 (2004)
20. E. Brook, D. Rosman, C. Holman, Public good through data linkage: measuring research
outputs from the Western Australian data linkage system. Aust. NZ J. Public Health 32,
19–23 (2008)
21. R. Canetti, Security and composition of multiparty cryptographic protocols. J. Cryptol. 13(1),
143–202 (2000)
22. A. Cavoukian, J. Jonas, Privacy by design in the age of Big Data. Technical report, TR
Information and privacy commissioner, Ontario (2012)
23. P. Christen, A comparison of personal name matching: techniques and practical issues, in
IEEE ICDM Workshop on Mining Complex Data (2006), pp. 290–294
24. P. Christen, Privacy-preserving data linkage and geocoding: current approaches and research
directions, in IEEE ICDM Workshop on Privacy Aspects of Data Mining (2006), pp. 497–501
25. P. Christen, Automatic record linkage using seeded nearest neighbour and support vector
machine classification, in ACM SIGKDD (2008), pp. 151–159
26. P. Christen, Febrl: an open source data cleaning, deduplication and record linkage system
with a graphical user interface, in ACM SIG KDD (2008), pp. 1065–1068
27. P. Christen, Geocode matching and privacy preservation, in Workshop on Privacy, Security,
and Trust in KDD (Springer, Berlin, 2009), pp. 7–24
28. P. Christen, Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution,
and Duplicate Detection (Springer, Berlin, 2012)
29. P. Christen, A survey of indexing techniques for scalable record linkage and deduplication.
IEEE TKDE 24(9), 1537–1555 (2012)
30. P. Christen, T. Churches, M. Hegland, Febrl – a parallel open source data linkage system, in
Springer PAKDD (2004), pp. 638–647
31. P. Christen, K. Goiser, Quality and complexity measures for data linkage and deduplication,
in Quality Measures in Data Mining, vol. 43. Studies in Computational Intelligence (Springer,
Berlin, 2007), pp. 127–151
32. P. Christen, R. Gayler, D. Hawking, Similarity-aware indexing for real-time entity resolution,
in ACM CIKM (2009), pp. 1565–1568
33. P. Christen, R.W. Gayler, Adaptive temporal entity resolution on dynamic databases, in
PAKD D (2013), pp. 558–569
34. P. Christen, D. Vatsalan, Flexible and extensible generation and corruption of personal data,
in ACM CIKM (2013), pp. 1165–1168
35. T. Churches, P. Christen, Some methods for blindfolded record linkage. BioMed Cent. Med.
Inf. Decision Mak. 4(9), (2004)
36. T. Churches, P. Christen, K. Lim, J.X. Zhu, Preparation of name and address data for record
linkage using hidden Markov models. BioMed Cent. Med. Inf. Decision Mak. 2(9), (2002)
37. D.E. Clark, Practical introduction to record linkage for injury research. Inj. Prev. 10, 186–191
(2004)
38. C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, M. Zhu, Tools for privacypreserving distributed
data mining. SIGKDD Explor. 4(2), 28–34 (2002)
39. W.W. Cohen, Data integration using similarity joins and a word-based information represen-
tation language. ACM TOIS 18(3), 288–321 (2000)
890 D. Vatsalan et al.
40. W.W. Cohen, J. Richman, Learning to match and cluster large high-dimensional data sets for
data integration, in ACM SIGKDD (2002), pp. 475–480
41. G. Cormode, S. Muthukrishnan, An improved data stream summary: the count-min sketch
and its applications. J. Algorithms 55(1), 58–75 (2005)
42. G. Dal Bianco, R. Galante, C.A. Heuser, A fast approach for parallel deduplication on multi-
core processors, in ACM Symposium on Applied Computing (2011), pp. 1027–1032
43. D. Dey, V. Mookerjee, D. Liu, Efficient techniques for online record linkage. IEEE TKDE
23(3), 373–387 (2010)
44. W. Du, M. Atallah, Protocols for secure remote database access with approximate matching,
in ACM WSPEC (Springer, Berlin, 2000), pp. 87–111
45. G.T. Duncan, M. Elliot, J.-J. Salazar-González, Statistical Confidentiality: Principles and
Practice (Springer, New York, 2011)
46. E. Durham, A framework for accurate, efficient private record linkage. Ph.D. thesis, Faculty
of the Graduate School of Vanderbilt University, Nashville, TN, 2012
47. E. Durham, Y. Xue, M. Kantarcioglu, B. Malin, Private medical record linkage with approx-
imate matching, in AMIA Annual Symposium (2010), pp. 182–186
48. E.A. Durham, C. Toth, M. Kuzu, M. Kantarcioglu, Y. Xue, B. Malin, Composite Bloom filters
for secure record linkage. IEEE TKDE 26(12), pp. 2956–2968 (2013)
49. L. Dusserre, C. Quantin, H. Bouzelat, A one way public key cryptosystem for the linkage of
nominal files in epidemiological studies. Medinfo 8, 644–647 (1995)
50. C. Dwork, Differential privacy, in ICALP (2006), pp. 1–12
51. M.G. Elfeky, V.S. Verykios, A.K. Elmagarmid, TAILOR: a record linkage toolbox, in IEEE
ICDE (2002), pp. 17–28
52. A. Elmagarmid, P. Ipeirotis, V.S. Verykios, Duplicate record detection: a survey. IEEE TKDE
19(1), 1–16 (2007)
53. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, Advances in Knowledge Discovery
and Data Mining (The MIT Press, Cambridge, 1996)
54. I.P. Fellegi, A.B. Sunter, A theory for record linkage. J. Am. Stat. Soc. 64(328), 1183–1210
(1969)
55. S.E. Fienberg, Confidentiality and disclosure limitation. Encycl. Soc. Meas. 1, 463–469 (2005)
56. B. Forchhammer, T. Papenbrock, T. Stening, S. Viehmeier, U. Draisbach, F. Naumann, Dupli-
cate detection on GPUs, in BTW (2013), pp. 165–184
57. M. Freedman, Y. Ishai, B. Pinkas, O. Reingold, Keyword search and oblivious pseudorandom
functions, in Theory of Cryptography (2005), pp. 303–324
58. Z. Fu, J. Zhou, P. Christen, M. Boot, Multiple instance learning for group record linkage, in
PAKDD, Springer LNAI (2012), pp. 171–182
59. B. Fung, K. Wang, R. Chen, P.S. Yu, Privacy-preserving data publishing: a survey of recent
developments. ACM Comput. Surv. 42(4), 14 (2010)
60. S.R. Ganta, S.P. Kasiviswanathan, A. Smith, Composition attacks and auxiliary information
in data privacy, in ACM SIGKDD (2008), pp. 265–273
61. A. Gionis, P. Indyk, R. Motwani, Similarity search in high dimensions via hashing, in VLDB
(1999), pp. 518–529
62. O. Goldreich, Foundations of Cryptography: Basic Applications, vol. 2. (Cambridge Univer-
sity Press, Cambridge, 2004)
63. L. Gu, R. Baxter, Decision models for record linkage, in Selected Papers from AusDM.LNCS,
vol. 3755 (Springer, Berlin, 2006), pp. 146–160
64. M. Hadjieleftheriou, A. Chandel, N. Koudas, D. Srivastava, Fast indexes and algorithms for
set similarity selection queries, in IEEE ICDE (2008), pp. 267–276
65. R. Hall, S. Fienberg, Privacy-preserving record linkage, in PSD (2010), pp. 269–283
66. M. Herschel, F. Naumann, S. Szott, M. Taubert, Scalable iterative graph duplicate detection.
IEEE TKDE 24(11), 2094–2108 (2012)
67. A. Inan, M. Kantarcioglu, E. Bertino, M. Scannapieco, A hybrid approach to private record
linkage, in IEEE ICDE (2008), pp. 496–505
Privacy-Preserving Record Linkage for Big Data: Current Approaches … 891
68. A. Inan, M. Kantarcioglu, G. Ghinita, E. Bertino. Private record matching using differential
privacy, in EDBT (2010), pp. 123–134
69. P. Indyk, R. Motwani, Approximate nearest neighbors: Towards removing the curse of dimen-
sionality,inACM Symposium on the Theory of Computing (1998), pp. 604–613
70. E. Ioannou, W. Nejdl, C. Niederée, Y. Velegrakis, On-the-fly entity-aware query processing
in the presence of linkage. PVLDB 3(1–2), 429–438 (2010)
71. W. Jiang, C. Clifton, Ac-framework for privacy-preserving collaboration, in SDM SIAM
(2007), pp. 47–56
72. W. Jiang, C. Clifton, M. Kantarcıo˘glu, Transforming semi-honest protocols to ensure account-
ability. Elsevier DKE 65(1), 57–74 (2008)
73. J. Jonas, J. Harper, Effective counterterrorism and the limited role of predictive data mining.
Policy Anal. 584, 1–12 (2006)
74. D. Kalashnikov, S. Mehrotra, Domain-independent data cleaning via analysis of entity-
relationship graph. ACM TODS 31(2), 716–767 (2006)
75. M. Kantarcioglu, W. Jiang, B. Malin, A privacy-preserving framework for integrating person-
specific databases, in PSD (2008), pp. 298–314
76. A. Karakasidis, V.S. Verykios, Secure blocking+secure matching =secure record linkage.
JCSE 5, 223–235 (2011)
77. A. Karakasidis, V.S. Verykios, Reference table based k-anonymous private blocking, in ACM
SAC (2012), pp. 859–864
78. A. Karakasidis, V.S. Verykios, A sorted neighborhood approach to multidimensional privacy
preserving blocking, in IEEE ICDMW (2012), pp. 937–944
79. A. Karakasidis, V.S. Verykios, P.Christen, Fake injection strategies for private phonetic match-
ing. DPM Springer 7122, 9–24 (2012)
80. D. Karapiperis, D. Vatsalan, V.S. Verykios, P. Christen, Large-scale multi-party counting set
intersection using a space efficient global synopsis, in DASFAA (2015), pp. 329–345
81. D. Karapiperis, D. Vatsalan, V.S. Verykios, P. Christen, Efficient record linkage using a com-
pact hamming space, in EDBT (2016), pp. 209–220
82. D. Karapiperis, V.S. Verykios, A distributed framework for scaling up LSH-based computa-
tions in privacy preserving record linkage, in ACM BCI (2013), pp. 102–109
83. D. Karapiperis, V.S. Verykios, A distributed near-optimal LSH-based framework for privacy-
preserving record linkage. ComSIS 11(2), 745–763 (2014)
84. D. Karapiperis, V.S. Verykios, An LSH-based blocking approach with a homomorphic match-
ing technique for privacy-preserving record linkage. IEEE TKDE 27(4), 909–921 (2015)
85. D. Karapiperis, V.S. Verykios, A fast and efficient hamming LSH-based scheme for accurate
linkage, in Springer KAIS (2016), pp. 1–24
86. H. Kargupta, S. Datta, Q. Wang, K. Sivakumar, On the privacypreserving properties of random
data perturbation techniques, in IEEE ICDM (2003), p. 99
87. H. Kargupta, S. Datta, Q. Wang, K. Sivakumar, Random-data perturbation techniques and
privacy-preserving data mining, Springer KAIS 7(4), 387–414 (2005)
88. C.W. Kelman, J. Bass, D. Holman, Research use of linked health data - a best practice protocol.
Aust.NZJ.PublicHealth26, 251–255 (2002)
89. H. Kim, D. Lee, Harra: fast iterative hashed record linkage for large-scale data collections, in
EDBT (2010), pp. 525–536
90. H.-s. Kim, D. Lee, Parallel linkage, in ACM CIKM (2007), pp. 283–292
91. T. Kirsten, L. Kolb, M. Hartung, A. Groß, H. Köpcke, E. Rahm, Data partitioning for parallel
entity matching, in QDB (2010)
92. L. Kissner, D. Song, Private and threshold set-intersection, in Technical Report. Carnegie
Mellon University, 2004
93. L. Kolb, A. Thor, E. Rahm, Dedoop: efficient deduplication with Hadoop. PVLDB 5(12),
1878–1881 (2012)
94. L. Kolb, A. Thor, E. Rahm, Load balancing for mapreduce-based entity resolution, in IEEE
ICDE (2012), pp. 618–629
892 D. Vatsalan et al.
95. H. Köpcke, E. Rahm, Frameworks for entity matching: a comparison. Elsevier DKE 69(2),
197–210 (2010)
96. H. Köpcke, A. Thor, E. Rahm, Evaluation of entity resolution approaches on real-world match
problems. PVLDB 3(1), 484–493 (2010)
97. H. Krawczyk, M. Bellare, R. Canetti, HMAC: keyed-hashing for message authentication, in
Internet RFCs (1997)
98. T.G. Kristensen, J. Nielsen, C.N. Pedersen, A tree-based method for the rapid screening of
chemical fingerprints. Algorithms Mol. Biol. 5(1), 9 (2010)
99. H. Kum, A. Krishnamurthy, A. Machanavajjhala, S. Ahalt, Population informatics: tapping
the social genome to advance society: a vision for putting “big data” to work for population
informatics. Computer (2013)
100. H.-C. Kum, A. Krishnamurthy, A. Machanavajjhala, M.K. Reiter,S. Ahalt, Privacy preserving
interactive record linkage. JAMIA 21(2), 212–220 (2014)
101. M. Kuzu, M. Kantarcioglu, E. Durham, B. Malin, A constraint satisfaction cryptanalysis of
Bloom filters in private record linkage. PETS Springer LNCS 6794, 226–245 (2011)
102. M. Kuzu, M. Kantarcioglu, E.A. Durham, C. Toth, B. Malin, A practical approach to achieve
private medical record linkage in light of public resources. JAMIA 20(2), 285–292 (2013)
103. M. Kuzu, M. Kantarcioglu, A. Inan, E. Bertino, E. Durham, B. Malin, Efficient privacy-aware
record integration, in ACM EDBT (2013), pp. 167–178
104. P. Lai, S. Yiu, K. Chow, C. Chong, L. Hui, An efficient Bloom filter based solution for
multiparty private matching, in SAM (2006)
105. F. Li, Y. Chen, B. Luo, D. Lee, P. Liu, Privacy preserving group linkage, in Scientific and
Statistical Database Management (Springer, Berlin, 2011), pp. 432–450
106. N. Li, T. Li, S. Venkatasubramanian, T-closeness: privacy beyondk-anonymity and l-diversity,
in IEEE ICDE (2007), pp. 106–115
107. P. Li, X. Dong, A. Maurino, D. Srivastava, Linking temporal records. PVLDB 4(11), 956–967
(2011)
108. Z. Lin, M. Hewett, R.B. Altman, Using binning to maintain confidentiality of medical data,
in AMIA Symposium (2002), p. 454
109. Y. Lindell, B. Pinkas, Privacy preserving data mining, in CRYPTO (Springer, Berlin, 2000),
pp. 36–54
110. Y. Lindell, B. Pinkas, An efficient protocol for secure two-party computation in the presence
of malicious adversaries, in EUROCRYPT (2007), pp. 52–78
111. Y. Lindell, B. Pinkas, Secure multiparty computation for privacy-preserving data mining. JPC
1(1), 5 (2009), pp. 59–98
112. H. Liu, H. Wang, Y. Chen, Ensuring data storage security against frequency-based attacks in
wireless networks, in DCOSS, Springer LNCS, vol. 6131 (2010), pp. 201–215
113. H. Lu, M.-C. Shan, K.-L. Tan, Optimization of multi-way join queries for parallel execution,
in VLDB (1991), pp. 549–560
114. M. Luby, C. Rackoff, How to construct pseudo-random permutations from pseudo-random
functions, in CRYPTO, vol. 85 (1986), p. 447
115. A. Machanavajjhala, D. Kifer, J. Gehrke, M. Venkitasubramaniam, l-diversity: privacy beyond
k-anonymity. ACM TKDD 1(1), 3 (2007)
116. B.A. Malin, K. El Emam, C.M. O’Keefe, Biomedical data privacy: problems, perspectives,
and recent advances. JAMIA 20(1), 2–6 (2013)
117. M. Mitzenmacher, E. Upfal, Probability and Computing: Randomized Algorithms and Prob-
abilistic Analysis (Cambridge University Press, Cambridge, 2005)
118. N. Mohammed, B. Fung, M. Debbabi, Anonymity meets game theory: secure data integration
with malicious participants. PVLDB 20(4), 567–588 (2011)
119. M. Nentwig, M. Hartung, A.-C. Ngonga Ngomo, E. Rahm, A survey of current link discovery
frameworks. Semantic Web Journal (2016)
120. A.N. Ngomo, L. Kolb, N. Heino, M. Hartung, S. Auer, E. Rahm, When to reach for the cloud:
using parallel hardware for link discovery, in ESWC (2013), pp. 275–289
121. Office for National Statistics, Beyond 2011 matching anonymous data (2013)
Privacy-Preserving Record Linkage for Big Data: Current Approaches … 893
122. C. O’Keefe, M. Yung, L. Gu, R. Baxter, Privacy-preserving data linkage protocols, in ACM
WPES (2004), pp. 94–102
123. B. On, N. Koudas, D. Lee, D. Srivastava, Group linkage, in IEEE ICDE (2007), pp. 496–505
124. C. Pang, L. Gu, D. Hansen, A. Maeder, Privacy-preserving fuzzy matching using a pub-
lic reference table, in Intelligent Patient Management, vol. 189. Studies in Computational
Intelligence (Springer, Berlin, 2009), pp. 71–89
125. C. Phua, K. Smith-Miles, V. Lee, R. Gayler, Resilient identity crime detection. IEEE TKDE
24(3), 533–546 (2012)
126. C. Quantin, H. Bouzelat, L. Dusserre, Irreversible encryption method by generation of poly-
nomials. Med. Inf. Internet Med. 21(2), 113–121 (1996)
127. C. Quantin, H. Bouzelat, F. Allaert, A. Benhamiche, J. Faivre, L. Dusserre, How to ensure
data security of an epidemiological follow-up: quality assessment of an anonymous record
linkage procedure. IJMI 49(1), 117–122 (1998)
128. E. Rahm, H.H. Do, Data cleaning: problems and current approaches. IEEE Data Eng. Bull.
23(4), 3–13 (2000)
129. B. Ramadan, P. Christen, H. Liang, R.W. Gayler, Dynamic sorted neighborhood indexing for
real-time entity resolution. ACM JDIQ 6(4), 15 (2015)
130. T. Ranbaduge, P. Christen, D. Vatsalan, Tree based scalable indexing for multi-party privacy-
preserving record linkage, in AusDM (2014)
131. T. Ranbaduge, D. Vatsalan, P. Christen, Clustering-based scalable indexing for multi-party
privacy-preserving record linkage, in Springer PAKDD (2015), pp. 549–561
132. T. Ranbaduge, D. Vatsalan, P. Christen, Merlin–a tool for multi-party privacy-preserving
record linkage, in IEEE ICDMW (2015), pp. 1640–1643
133. T. Ranbaduge, D. Vatsalan, P. Christen, Hashing-based distributed multi-party blocking for
privacy-preserving record linkage, in Springer PAKDD (2016), pp. 415–427
134. T. Ranbaduge, D. Vatsalan, S. Randall, P. Christen, Evaluation of advanced techniques for
multi-party privacy-preserving record linkage on real-world health databases, in IPDLN
(2016)
135. S.M. Randall, A.M. Ferrante, J.H. Boyd, J.B. Semmens, Privacy-preserving record linkage
on large real world datasets, in Elsevier JBI (2014) volume 50, pp. 205–212
136. S.M. Randall, A.M. Ferrante, J.H. Boyd, A.P. Brown, J.B. Semmens, Limited privacy pro-
tection and poor sensitivity is it time to move on from the statistical linkage key-581? Health
Inf. Manag. J. 37, 60–62 (2016)
137. V. Rastogi, N. Dalvi, M. Garofalakis, Large-scale collective entity matching. in VLDB 4,
208–218 (2011)
138. C. Rong, W. Lu, X. Wang, X. Du, Y. Chen, A.K.H. Tung, Efficient and scalable processing
of string similarity join. IEEE TKDE 25(10), 2217–2230 (2013)
139. M. Roughan, Y. Zhang, Secure distributed data-mining and its application to large-scale
network measurements. ACM SIGCOMM Comput. Commun. Rev. 36(1), 7–14 (2006)
140. T. Ryan, D. Gibson, B. Holmes, A national minimum data set for home and community care,
in Australian Institute of Health and Welfare (1999)
141. M. Scannapieco, I. Figotin, E. Bertino, A. Elmagarmid, Privacy preserving schema and data
matching, in ACM SIGMOD (2007), pp. 653–664
142. D.A. Schneider, D.J. DeWitt, Tradeoffs in processing complex join queries via hashing in
multiprocessor database machines, in VLDB (1990), pp. 469–480
143. B. Schneier, Applied Cryptography: Protocols, Algorithms, and Source Code in C, 2nd edn.
(Wiley, New York, 1996)
144. R. Schnell, Privacy-preserving record linkage and privacy-preserving blocking for large files
with cryptographic keys using multibit trees, in JSM (2013), pp. 187–194
145. R. Schnell, An efficient privacy-preserving record linkage technique for administrative data
and censuses. Stat. J. IAOS 30(3), 263–270 (2014)
146. R. Schnell, T. Bachteler, S. Bender, A toolbox for record linkage. Aust. J. Stat. 33(1–2),
125–133 (2004)
894 D. Vatsalan et al.
147. R. Schnell, T. Bachteler, J. Reiher, Privacy-preserving record linkage using Bloom filters.
BMC Medi. Inf. Decision Mak. 9(1), 41 (2009)
148. R. Schnell, T. Bachteler, J. Reiher, A novel error-tolerant anonymous linking code, in German
Record Linkage Center, WP-GRLC-2011-02 (2011)
149. Z. Sehili, E. Rahm, Speeding up privacy preserving record linkage for metric space similarity
measures, in Datenbank-Spektrum (2016), pp. 1–10
150. Z. Sehili, L. Kolb, C. Borgs, R. Schnell, E. Rahm, Privacy preserving record linkage with PP
Join, in BTW Conference (2015)
151. D. Song, D. Wagner, A. Perrig, Practical techniques for searches on encrypted data, in IEEE
Symposium on Security and Privacy (2000), pp. 44–55
152. L. Sweeney, K-anonymity: a model for protecting privacy. Int. J. Uncertaint. Fuzziness Knowl.
Based Syst. 10(5), 557–570 (2002)
153. K.-N. Tran, D. Vatsalan, P. Christen, GeCo: an online personal data generator and corruptor,
in ACM CIKM (2013), pp. 2473–2476
154. S. Trepetin, Privacy-preserving string comparisons in record linkage systems: a review. Inf.
Secur. J.: A Global Perspect. 17(5), 253–266 (2008)
155. E. Turgay, T. Pedersen, Y. Saygın, E. Savs, A. Levi, Disclosure risks of distance preserving
data transformations, in Springer SSDBM (2008), pp. 79–94
156. J. Vaidya, Y. Zhu, C.W. Clifton, Privacy Preserving Data Mining, vol. 19. Advances in Infor-
mation Security (Springer, Berlin, 2006)
157. E. Van Eycken, K. Haustermans, F. Buntinx et al., Evaluation of the encryption procedure and
record linkage in the Belgian national cancer registry. Archiv. Public Health 58(6), 281–294
(2000)
158. D. Vatsalan, P. Christen, An iterative two-party protocol for scalable privacy-preserving record
linkage, in AusDM, CRPIT (2012), pp. 127–138
159. D. Vatsalan, P. Christen, Sorted nearest neighborhood clustering for efficient private blocking,
in Springer PAKDD, vol. 7819 (2013), pp. 341–352
160. D. Vatsalan, P. Christen, Scalable privacy-preserving record linkage for multiple databases,
in ACM CIKM (2014), pp. 1795–1798
161. D. Vatsalan, P. Christen, Privacy-preserving matching of similar patients. Elsevier JBI 59,
285–298 (2016)
162. D. Vatsalan, P. Christen, V.S.Verykios, An efficient two-partyprotocol for approximate match-
ing in private record linkage, in AusDM (2011), pp. 125–136
163. D. Vatsalan, P. Christen, V.S. Verykios, Efficient two-party private blocking based on sorted
nearest neighborhood clustering, in ACM CIKM (2013), pp. 1949–1958
164. D. Vatsalan, P. Christen, V.S. Verykios, A taxonomy of privacy-preserving record linkage
techniques. Elsevier JIS 38(6), 946–969 (2013)
165. D. Vatsalan, P. Christen, C.M. O’Keefe, V.S. Verykios, An evaluation framework for privacy-
preserving record linkage. JPC 6(1), 3 (2014), pp. 35–75
166. R. Vernica, M.J. Carey, C. Li, Efficient parallel set-similarity joins using MapReduce, in ACM
SIGMOD (2010), pp. 495–506
167. V.S. Verykios, A. Karakasidis, V. Mitrogiannis, Privacy preserving record linkage approaches.
IJDMMM 1(2), 206–221 (2009)
168. G. Wang, H. Chen, H. Atabakhsh, Automatically detecting deceptive criminal identities.
Commun. ACM 47(3), 70–76 (2004)
169. Q. Wang, D. Vatsalan, P. Christen, Efficient interactive training selection for large-scale entity
resolution, in PAKDD (2015), pp. 562–573
170. Z. Wen, C. Dong, Efficient protocols for privaterecord linkage, in ACM Symposium on Applied
Computing (2014), pp. 1688–1694
171. W.E. Winkler, Methods for evaluating and creating data quality. Elsevier JIS 29(7), 531–550
(2004)
172. C. Xiao, W. Wang, X. Lin, J.X. Yu, Efficient similarity joins for near duplicate detection, in
WWW (2008), pp. 131–140
Privacy-Preserving Record Linkage for Big Data: Current Approaches … 895
173. M. Yakout, M. Atallah, A. Elmagarmid, Efficient private record linkage, in IEEE ICDE (2009),
pp. 1283–1286
174. P. Zezula, G. Amato, V. Dohnal, M. Batko, Similarity Search: The Metric Space Approach,
vol. 32 (Springer, Berlin, 2006)
175. X. Zhang, C. Liu, S. Nepal, J. Chen, An efficient quasi-identifier index based approach for
privacy preservation over incremental data sets on cloud. J. Comput. Syst. Sci. 79(5), 542–555
(2013)
176. X. Zhang, C. Liu, S. Nepal, S. Pandey, J. Chen, A privacy leakage upper bound constraint-
based approach for cost-effective privacy preserving of intermediate data sets in cloud. IEEE
TPDS 24(6), 1192–1202 (2013)
... Record linkage [1], as a core technology, can accurately match entities from different data sources to the same real-world entity [2], thereby facilitating data integration and sharing. However, record linkage faces significant challenges [3][4][5], with the most prominent being privacy and security concerns. In the process of linking data, given that the data contain sensitive information such as personal details, medical data, etc., we must carefully consider and properly address privacy protection issues to ensure the security of the data. ...
... To ensure the linkage quality between records, we formulated corresponding linkage rules: 1 ⃝ If the secondary encodings of each split are consistent, then the linkage is deemed successful (Algorithm 2, lines 1-2). 2 ⃝ If there are at least e splits that are the same in the secondary encoding (Algorithm 2, lines 4-5), then the BFS of the inconsistent splits is traced back, and further calculation of the Dice similarity is performed to determine whether the match is successful. 3 ⃝ If there are fewer than e splits that are the same in the secondary encoding (Algorithm 2, lines 6-7), then the linkage is deemed unsuccessful. Here, a represents the fault tolerance level, and the setting of e can control the fault tolerance level of the method, referred to as the error limit. ...
... As shown in Figure 3, in this example, the BFS results of φ 1 and φ′ 1 and those of φ 2 and φ 2 ′ are the same, and so is their secondary encoding. However, the BFS results of φ 3 and φ 3 ′ are not the same, and thus their secondary encoding is also different, where φ 3 is (15,2) and φ 3 ′ is (26,3). If we assume one character error is allowed, then e = m − ak = 3 − 2 = 1 < 2, which satisfies linkage rule 2 ⃝. ...
Article
Full-text available
With the advent of the big data era, data security and sharing have become the core elements of new-era data processing. Privacy-preserving record linkage (PPRL), as a method capable of accurately and securely matching and sharing the same entity across multiple data sources, is receiving increasing attention. Among the existing research methods, although PPRL methods based on Bloom Filter encoding excel in computational efficiency, they are susceptible to privacy attacks, and the security risks they face cannot be ignored. To balance the contradiction between security and computational efficiency, we propose a multi-party PPRL method based on secondary encoding. This method, based on Bloom Filter encoding, generates secondary encoding according to well-designed encoding rules and utilizes the proposed linking rules for secure matching. Owing to its excellent encoding and linking rules, this method successfully addresses the balance between security and computational efficiency. The experimental results clearly show that, in comparison to the original Bloom Filter encoding, this method has nearly equivalent computational efficiency and linkage quality. The proposed rules can effectively prevent the re-identification problem in Bloom Filter encoding (proven). Compared to existing privacy-preserving record linkage methods, this method shows higher security, making it more suitable for various practical application scenarios. The introduction of this method is of great significance for promoting the widespread application of privacy-preserving record linkage technology.
... D. Vatsalan et al. [15] describe the use of Bloom filters using q-grams, with blocking/filtering for comparison. This paper also describes the challenges with this implementation and various forms of attack, such as dictionary, frequency, and cryptanalysis. ...
... Since Broder [26] introduced MinHash, several types of Min-Hash algorithms have been developed, such as HyperMinHash [15], MinMaxHash [27], ProbMinHash [28], and SuperMin-Hash [8], which have improved the speed and accuracy of the Jaccard index computation. We used SuperMinHash in our implementation owing to its accuracy and speed. ...
... • Use of Bloom filters referred to as Blm-Fltr from the papers [15,32] and [16] for comparison. We already know that they are prone to cryptanalysis attacks [23]. ...
Article
Full-text available
Fraud is a persistent and increasing problem in the telecom industry. Telcos work in isolation to prevent fraud. Sharing information is critical for detecting and preventing fraud. The primary constraint on sharing information is privacy preservation. Several techniques have been developed to share data while preserving privacy using privacy‐preserving record linkage (PPRL). Most of the PPRL techniques use a similarity measure like Jacquard similarity on homologous datasets, which are all prone to graph‐based attacks, rendering existing methods insecure. Many complex and slow techniques use the Bloom filter implementation, which can be compromised in a cryptanalysis attack. This paper proposes an attack‐proof PPRL method using existing infrastructure of a telco without a complex multistep protocol. First, a novel way of matching two non‐homologous datasets using attack‐proof digital signature schemes, like the Edwards‐curve digital signature algorithm is proposed. Here, Jaccard similarity can only be estimated using this method and not on the datasets directly. Second, two parties transact with a simple request–reply method. To validate the match accuracy, privacy preservation, and performance of this approach, it was tested on a large public dataset (North Carolina Voter Database). This method is secure against attacks and achieves 100% match accuracy with improved performance.
... Their experiments have shown that though deep learning does not bring any additional advantage over clean structured data, deep learning models can significantly outperform traditional record linkage approaches on textual data and where the data is prone to errors and variations. As real data often contains errors and variations [1], the use of deep learning can benefit the linkage of databases, especially in big data applications [8]. ...
... A long line of research has been conducted in privacypreserving record linkage (PPRL) over the past three decades [1], [2], [8]. As surveyed in [10], the existing PPRL techniques can be categorised into four generations. ...
... The second generation considered fuzzy matching of QIDs to account for errors and variations in the QIDs across different data sets. The third generation considered the scalability of the PPRL process to linking large databases as an important aspect in addition to the support for fuzzy matching [8], [11]. The fourth generation of techniques focused on data-driven technologies by developing more advanced methods tailored to Big Data. ...
Article
Full-text available
italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Deep learning -based linkage of records across different databases is becoming increasingly useful in data integration and mining applications to discover new insights from multiple data sources. However, due to privacy and confidentiality concerns, organisations often are unwilling or allowed to share their sensitive data with any external parties, thus making it challenging to build/train deep learning models for record linkage across different organisations' databases. To overcome this limitation, we propose the first deep learning-based multi-party privacy-preserving record linkage (PPRL) protocol that can be used to link sensitive databases held by multiple different organisations. In our approach, each database owner first trains a local deep learning model, which is then uploaded to a secure environment and securely aggregated to create a global model. The global model is then used by a linkage unit to distinguish unlabelled record pairs as matches and non-matches. We utilise differential privacy to achieve provable privacy protection against re-identification attacks. We evaluate the linkage quality and scalability of our approach using several large real-world databases, showing that it can achieve high linkage quality while providing sufficient privacy protection against existing attacks.
... Existing solutions to address these challenges include the use of privacy-preserving entity resolution (PPER) techniques, which enable the matching of records across datasets while keeping the data encoded and ensuring that individual privacy is maintained [16,31,32]. These techniques encompass methods such as hashing, masking, perturbation, and probabilistic approaches, as well as the use of natural language models to encode records into multi-dimensional embedding vectors. ...
Preprint
Full-text available
Patient datasets contain confidential information which is protected by laws and regulations such as HIPAA and GDPR. Ensuring comprehensive patient information necessitates privacy-preserving entity resolution (PPER), which identifies identical patient entities across multiple databases from different healthcare organizations while maintaining data privacy. Existing methods often lack cryptographic security or are computationally impractical for real-world datasets. We introduce a PPER pipeline based on AMPPERE, a secure abstract computation model utilizing cryptographic tools like homomorphic encryption. Our tailored approach incorporates extensive parallelization techniques and optimal parameters specifically for patient datasets. Experimental results demonstrate the proposed method's effectiveness in terms of accuracy and efficiency compared to various baselines.
... N3C privacy-preserving record linkage methods PPRL methods typically match patient records with the use of "deidentified tokens"-hashed combinations of demographic and identifier fields that define a match across multiple data repositories without exposing patients' clear-text PHI or PII identifiers, and are deidentified based on the Expert Determination Standard of the HIPAA Privacy Rule.[14][15][16][17] In the N3C initiative, those clear-text identifiers remain only with the patient's home health care organization; home organizations convert the identifiers to tokens for its own patient data using hashingsoftware and guidelines provided by the LHB and Datavant. ...
Article
Full-text available
Introduction Research driven by real‐world clinical data is increasingly vital to enabling learning health systems, but integrating such data from across disparate health systems is challenging. As part of the NCATS National COVID Cohort Collaborative (N3C), the N3C Data Enclave was established as a centralized repository of deidentified and harmonized COVID‐19 patient data from institutions across the US. However, making this data most useful for research requires linking it with information such as mortality data, images, and viral variants. The objective of this project was to establish privacy‐preserving record linkage (PPRL) methods to ensure that patient‐level EHR data remains secure and private when governance‐approved linkages with other datasets occur. Methods Separate agreements and approval processes govern N3C data contribution and data access. The Linkage Honest Broker (LHB), an independent neutral party (the Regenstrief Institute), ensures data linkages are robust and secure by adding an extra layer of separation between protected health information and clinical data. The LHB's PPRL methods (including algorithms, processes, and governance) match patient records using “deidentified tokens,” which are hashed combinations of identifier fields that define a match across data repositories without using patients' clear‐text identifiers. Results These methods enable three linkage functions: Deduplication, Linking Multiple Datasets, and Cohort Discovery. To date, two external repositories have been cross‐linked. As of March 1, 2023, 43 sites have signed the LHB Agreement; 35 sites have sent tokens generated for 9 528 998 patients. In this initial cohort, the LHB identified 135 037 matches and 68 596 duplicates. Conclusion This large‐scale linkage study using deidentified datasets of varying characteristics established secure methods for protecting the privacy of N3C patient data when linked for research purposes. This technology has potential for use with registries for other diseases and conditions.
... This attack can severely impact a person's privacy when the dataset is highly sensitive, as is the case e.g. for sexually transmitted diseases. Other forms of attack are data extraction attacks, in which the exact words in the dataset can be reconstructed [12], or record linkage attacks that aim to associate individuals with their sensitive attributes [13]. As we will discuss in depth in this article, anonymizing information requires much more than deleting columns with explicit personal identifiers, and the text cannot be left unscreened. ...
Preprint
Full-text available
Despite the high creation cost, annotated corpora are indispensable for robust natural language processing systems. In the clinical field, apart from annotating medical entities, personally identifiable information (PII) must be removed, especially in the era of large language models where unwanted memorization can occur. This paper presents a corpus annotated to anonymize personally identifiable information in 1,787 anamneses of work-related accidents and diseases in Spanish. In addition, we applied a previously released model for Named Entity Recognition (NER) trained on referrals from primary care physicians to identify diseases, body parts, and medications in this work-related text. We analyzed the differences between the models and the gold standard curated by a physician in detail. Moreover, we compared the performance of the NER model on the original narratives, in narratives where personal information has been masked, and in texts where the personal data is replaced by another similar expression (pseudonymization). Within this publication, we share the annotation guidelines and the annotated corpus.
... Data linkage has become simpler with the rapid growth of data science techniques such as machine learning, data mining, and big data (Raghupathi & Raghupathi, 2014). These methods may identify duplication and normalise variables across data sets (Vatsalan et al., 2017). These techniques may reveal traffic crash causes and injuries, which may inform road safety laws and measures. ...
Article
Full-text available
This systematic review examines studies of traffic injury that involved linkage of police crash data and hospital data and were published from 1994 to 2023 worldwide in English. Inclusion and exclusion criteria were the basis for selecting papers from PubMed, Web of Science, and Scopus, and for identifying additional relevant papers using PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) and supplementary snowballing (n = 60). The selected papers were reviewed in terms of research objectives, data items and sample size included, temporal and spatial coverage, linkage methods and software tools, as well as linkage rates and most significant findings. Many studies found that the number of clinically significant road injury cases was much higher according to hospital data than crash data. Underestimation of cases in crash data differs by road user type, pedestrian cases commonly being highly under-counted. A limited number of the papers were from low-and middle-income countries. The papers reviewed lack consistency in what was reported and how, which limited comparability.
Article
Full-text available
Federated learning (FL) refers to a system of training and stabilizing local machine learning models at the global level by aggregating the learning gradients of the models. It reduces the concern of sharing the private data of participating entities for statistical analysis to be carried out at the server. It allows participating entities called clients or users to infer useful information from their raw data. As a consequence, the need to share their confidential information with any other entity or the central entity called server is eliminated. FL can be clearly interpreted as a privacy-preserving version of traditional machine learning and deep learning algorithms. However, despite this being an efficient distributed training scheme, the client’s sensitive information can still be exposed to various security threats from the shared parameters. Since data has always been a major priority for any user or organization, this article is primarily concerned with discussing the significant problems and issues relevant to the preservation of data privacy and the viability and feasibility of several proposed solutions in the FL context. In this work, we conduct a detailed study on FL, the categorization of FL, the challenges of FL, and various attacks that can be executed to disclose the users’ sensitive data used during learning. In this survey, we review and compare different privacy solutions for FL to prevent data leakage and discuss secret sharing (SS)-based security solutions for FL proposed by various researchers in concise form. We also briefly discuss quantum federated learning (QFL) and privacy-preservation techniques in QFL. In addition to these, a comparison and contrast of several survey works on FL is included in this work. We highlight the major applications based on FL. We discuss certain future directions pertaining to the open issues in the field of FL and finally conclude our work.
Article
Full-text available
This exploration article digs into the significant effect of (artificial intelligence) AI, on big Data base management in the field of computer science by utilizing the broad abilities of artificial intelligence AI technologies, this study investigates different strategies to improve the oversight and administration of huge information bases. Through an exhaustive assessment of existing academic works, itemized contextual investigations, and master experiences, the article disentangles the mind-boggling combination of Ai’s integration and its important impacts on database management rehearses. Covering an extensive variety of artificial intelligence aspects, for example, AI, regular language handling, and profound realizing, this academic request plans to explain the complex job of computer-based intelligence in the domain of big database management. As a critical commitment to scholarly talk, this research article fills in as an original work, planning the groundbreaking excursion of Ai transformative into the center of data base management system, introducing another time of development and productivity in the computerized scene.
Article
Full-text available
Objective The linking of multiple (three or more) health databases is challenging because of the increasing sizes of databases, the number of parties among which they are to be linked, and privacy concerns related to the use of personal data such as names, addresses, or dates of birth. This entails a need to develop advanced scalable techniques for linking multiple databases while preserving the privacy of the individuals they contain. In this study we empirically evaluate several state-of-the-art multi-party privacy-preserving record linkage (MP-PPRL) techniques with large real-world health databases from Australia. ApproachMP-PPRL is conducted such that no sensitive information is revealed about database records that can be used to infer knowledge about individuals or groups of individuals. Current state-of-the-art methods used in this evaluation use Bloom filters to encode personal identifying information. The empirical evaluation comprises of different multi-party private blocking and matching techniques that are evaluated for different numbers of parties. Each database contains more than 700,000 records extracted from ten years of New South Wales (NSW) emergency presentation data. Each technique is evaluated with regard to scalability, quality and privacy. Scalability and quality are measured using the metrics of reduction ratio, pairs completeness, precision, recall, and F-measure. Privacy is measured using disclosure risk metrics that are based on the probability of suspicion, defined as the likelihood that a record in an encoded database matches to one or more record(s) in a publicly available database such as a telephone directory. MP-PPRL techniques that either utilize a trusted linkage unit, and those that do not, are evaluated. ResultsExperimental results showed MP-PPRL methods are practical for linking large-scale real world data. Private blocking techniques achieved significantly higher privacy than standard hashing-based techniques with a maximum disclosure risk of 0.0003 and 1, respectively, at a small cost to linkage quality and efficiency. Similarly, private matching techniques provided a similar acceptable reduction in linkage quality compared to standard non-private matching while providing high privacy protection. Conclusion The adoption of privacy-preserving linkage methods has the ability to significantly reduce privacy risks associated with linking large health databases, and enable the data linkage community to offer operational linkage services not previously possible. The evaluation results show that these state-of-the-art MP-PPRL techniques are scalable in terms of database sizes and number of parties, while providing significantly improved privacy with an associated trade-off in linkage quality compared to standard linkage techniques.
Article
Full-text available
The analysis of person-related data in Big Data applications faces the tradeoff of finding useful results while preserving a high degree of privacy. This is especially challenging when person-related data from multiple sources need to be integrated and analyzed. Privacy-preserving record linkage (PPRL) addresses this problem by encoding sensitive attribute values such that the identification of persons is prevented but records can still be matched. In this paper we study how to improve the efficiency and scalability of PPRL by restricting the search space for matching encoded records. We focus on similarity measures for metric spaces and investigate the use of M‑trees as well as pivot-based solutions. Our evaluation shows that the new schemes outperform previous filter approaches by an order of magnitude.
Conference Paper
Full-text available
In many application domains organizations require information from multiple sources to be integrated. Due to privacy and confidentiality concerns often these organizations are not willing or allowed to reveal their sensitive and personal data to other database owners, and to any external party. This has led to the emerging research discipline of privacy-preserving record linkage (PPRL). We propose a novel blocking approach for multi-party PPRL to efficiently and effectively prune the record sets that are unlikely to match. Our approach allows each database owner to perform blocking independently except for the initial agreement of parameter settings and a final central hashing-based clustering. We provide an analysis of our technique in terms of complexity, quality, and privacy, and conduct an empirical study with large datasets. The results show that our approach is scalable with the size of the datasets and the number of parties, while providing better quality and privacy than previous multi-party private blocking approaches.
Article
Full-text available
Real-time Entity Resolution (ER) is the process of matching query records in subsecond time with records in a database that represent the same real-world entity. Indexing techniques are generally used to efficiently extract a set of candidate records from the database that are similar to a query record, and that are to be compared with the query record in more detail. The sorted neighborhood indexing method, which sorts a database and compares records within a sliding window, has been successfully used for ER of large static databases. However, because it is based on static sorted arrays and is designed for batch ER that resolves all records in a database rather than resolving those relating to a single query record, this technique is not suitable for real-time ER on dynamic databases that are constantly updated. We propose a tree-based technique that facilitates dynamic indexing based on the sorted neighborhood method, which can be used for real-time ER, and investigate both static and adaptive window approaches. We propose an approach to reduce query matching times by precalculating the similarities between attribute values stored in neighboring tree nodes. We also propose a multitree solution where different sorting keys are used to reduce the effects of errors and variations in attribute values on matching quality by building several distinct index trees. We experimentally evaluate our proposed techniques on large real datasets, as well as on synthetic data with different data quality characteristics. Our results show that as the index grows, no appreciable increase occurs in both record insertion and query times, and that using multiple trees gives noticeable improvements on matching quality with only a small increase in query time. Compared to earlier indexing techniques for real-time ER, our approach achieves significantly reduced indexing and query matching times while maintaining high matching accuracy.
Article
Background: The statistical linkage key (SLK-581) is a common tool for record linkage in Australia, due to its ability to provide some privacy protection. However, newer privacy-preserving approaches may provide greater privacy protection, while allowing high-quality linkage. Objective: To evaluate the standard SLK-581, encrypted SLK-581 and a newer privacy-preserving approach using Bloom filters, in terms of both privacy and linkage quality. Method: Linkage quality was compared by conducting linkages on Australian health datasets using these three techniques and examining results. Privacy was compared qualitatively in relation to a series of scenarios where privacy breaches may occur. Results: The Bloom filter technique offered greater privacy protection and linkage quality compared to the SLK-based method commonly used in Australia. Conclusion: The adoption of new privacy-preserving methods would allow both greater confidence in research results, while significantly improving privacy protection.