ChapterPDF Available

Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges

February 2017

February 2017

DOI:10.1007/978-3-319-49340-4_25

In book: Handbook of Big Data Technologies

Authors:

Dinusha Vatsalan

Australian National University

Ziad Sehili

University of Leipzig

Peter Christen

Australian National University

Erhard Rahm

University of Leipzig

The growth of Big Data, especially personal data dispersed in multiple data sources, presents enormous opportunities and insights for businesses to explore and leverage the value of linked and integrated data. However, privacy concerns impede sharing or exchanging data for linkage across different organizations. Privacy-preserving record linkage (PPRL) aims to address this problem by identifying and linking records that correspond to the same real-world entity across several data sources held by different parties without revealing any sensitive information about these entities. PPRL is increasingly being required in many real-world application areas. Examples range from public health surveillance to crime and fraud detection, and national security. PPRL for Big Data poses several challenges, with the three major ones being (1) scalability to multiple large databases, due to their massive volume and the flow of data within Big Data applications, (2) achieving high quality results of the linkage in the presence of variety and veracity of Big Data, and (3) preserving privacy and confidentiality of the entities represented in Big Data collections. In this chapter, we describe the challenges of PPRL in the context of Big Data, survey existing techniques for PPRL, and provide directions for future research.

…

Triangle inequality: Object y cannot lie within the search radius of query object q since the difference between d( p, q) and d( p, y) exceeds rad(q) (taken from [149]) p q

…

Bloom filter masking-based exact matching approach for MP-PPRL as proposed by Lai et al. [104] (adapted from [164])

…

Figures - uploaded by Erhard Rahm

Content may be subject to copyright.

Content uploaded by Erhard Rahm

Content may be subject to copyright.

Privacy-Preserving Record Linkage

for Big Data: Current Approaches

and Research Challenges

Dinusha Vatsalan, Ziad Sehili, Peter Christen and Erhard Rahm

Abstract The growth of Big Data, especially personal data dispersed in multiple data

sources, presents enormous opportunities and insights for businesses to explore and

leverage the value of linked and integrated data. However, privacy concerns impede

sharing or exchanging data for linkage across different organizations. Privacy-

preserving record linkage (PPRL) aims to address this problem by identifying and

linking records that correspond to the same real-world entity across several data

sources held by different parties without revealing any sensitive information about

these entities. PPRL is increasingly being required in many real-world application

areas. Examples range from public health surveillance to crime and fraud detection,

and national security. PPRL for Big Data poses several challenges, with the three

major ones being (1) scalability to multiple large databases, due to their massive

volume and the ﬂow of data within Big Data applications, (2) achieving high quality

results of the linkage in the presence of variety and veracity of Big Data, and (3)

preserving privacy and conﬁdentiality of the entities represented in Big Data collec-

tions. In this chapter, we describe the challenges of PPRL in the context of Big Data,

survey existing techniques for PPRL, and provide directions for future research.

Keywords Record linkage ·Privacy ·Big data ·Scalability

D. Vatsalan ·P. Chr i s t e n

Research School of Computer Science, The Australian National University,

Acton, ACT 2601, Australia

e-mail: dinusha.vatsalan@anu.edu.au

P. Ch r i s t en

e-mail: peter.christen@anu.edu.au

Z. Sehili ·E. Rahm (B

)

Database Group, University of Leipzig, 04109 Leipzig, Germany

e-mail: rahm@informatik.uni-leipzig.de

Z. Sehili

e-mail: sehili@informatik.uni-leipzig.de

A.Y. Zomaya and S. Sakr (eds.), Handbook of Big Data Technologies,

DOI 10.1007/978-3-319-49340-4_25

851

852 D. Vatsalan et al.

1 Introduction

With the Big Data revolution, many organizations collect and process datasets that

contain many millions of records to analyze and mine interesting patterns and knowl-

edge in order to empower efﬁcient and quality decision making [28,53]. Analyzing

and mining such large datasets often require data from multiple sources to be linked

and aggregated. Linking records from different data sources with the aim to improve

data quality or enrich data for further analysis is occurring in an increasing number of

application areas, such as healthcare, government services, crime and fraud detection,

national security, and business applications [28,52]. Effective ways of linking data

from different sources have also played an increasingly important role in generating

new insights for population informatics in the health and social sciences [99].

For example, linking health databases from different organizations facilitates qual-

ity health data mining and analytics in applications such as epidemiological studies

(outbreak detection of infectious diseases) or adverse drug reaction studies [20,116].

These applications require data from several organizations to be linked, for example

human health data, travel data, consumed drug data, and even animal health data [38].

Linked health databases can also be used for the development of health policies in

a more efﬁcient and effective way compared to traditionally used time-consuming

survey studies [37,88].

Record linkage techniques are also being used by national security agencies and

crime investigators for effective identiﬁcation of fraud, crime, or terrorism sus-

pects [73,125,168]. Such applications require data from law enforcement agencies,

immigration departments, Internet service providers, businesses, as well as ﬁnancial

institutions [125].

In recent time, record linkage is increasingly being required by social scien-

tists in the ﬁeld of population informatics to study insights into our society from

‘social genome’ data, the digital traces that contain person-level data about social

beings [99]. The ‘Beyond 2011’ program by the Ofﬁce for National Statistics in the

UK, for example, has carried out research to study different possible approaches to

producing population and socio-demographics statistics for England and Wales by

linking data from several sources [121].

Record linkage within a single organization does not generally involve privacy

and conﬁdentiality concerns (assuming there are no internal threats within the orga-

nization and the linked data are not being revealed outside the organization). An

example application is the deduplication of a customer database by a business

using record linkage techniques for conducting effective marketing activities. How-

ever, in many countries record linkage across several organizations, as required in

the above example applications, might not allow the exchange or the sharing of

database records between organizations due to laws or regulations. Some example

Acts that describe the legal restrictions of disclosing personal or sensitive data are:

(1) the Data-Matching Program Act in Australia,1(2) the European Union (EU) Per-

1https://www.oaic.gov.au/privacy-law/other-legislation/government-data-matching [Accessed:

15/06/2016].

Privacy-Preserving Record Linkage for Big Data: Current Approaches … 853

sonal Data Protection Act in Europe,2and (3) the Health Insurance Portability and

Accountability Act (HIPAA) in the USA.3

The privacy requirements in the record linkage process have been addressed by

developing ‘privacy-preserving record linkage’ (PPRL) techniques, which aim to

identify matching records that refer to the same entities in different databases without

compromising privacy and conﬁdentiality of these entities. In a PPRL project, the

database owners (or data custodians) agree to reveal only selected information about

records that have been classiﬁed as matches among each other, or to an external

party, such as a researcher [164]. However, record linkage requires access to the

actual values of certain attributes.

Known as quasi-identiﬁers (QIDs), these attributes need to be common in all

databases to be linked and represent identifying characteristics of entities to allow

matching of records. Examples of QIDs are ﬁrst and last names, addresses, tele-

phone numbers, or dates of birth. Such QIDs often contain private and conﬁdential

information of entities that cannot be revealed, and therefore the linkage has to be

conducted on masked (encoded) versions of the QIDs to preserve the privacy of

entities. Several masking techniques have been developed (as we will describe in

Sect. 3.4), using two different types of general approaches: (1) secure multi-party

computation (SMC) [111] and (2) data perturbation [87].

Leveraging the tremendous opportunities that Big Data can provide for businesses

comes with the challenges that PPRL poses, including scalability, quality, and pri-

vacy. Big Data implies enormous data volume as well as massive ﬂows (velocity)

of data, leading to scalability challenges even with advanced computing technology.

The variety and veracity aspects of Big Data require biases, noise, variations and

abnormalities in data to be considered, which makes the linkage process more chal-

lenging. With Big Data containing massive amounts of personal data, linking and

mining data may breach the privacy of those represented by the data. A practical

PPRL solution that can be used in real-world applications should therefore address

these challenges of scalability, linkage quality, and privacy. A variety of PPRL tech-

niques has been developed over the past two decades, as surveyed in [154,164].

However, these existing approaches for PPRL fall short in providing a sound solu-

tion in the Big Data era by not addressing all of the Big Data challenges. Therefore,

more research is required to leverage the huge potential that linking databases in

the era of Big Data can provide for businesses, government agencies, and research

organizations.

In this chapter, we review the existing challenges and techniques, and discuss

research directions of PPRL for Big Data. We provide the preliminaries in Sect. 2

and review existing privacy techniques for PPRL in Sect.3. We then discuss the

scalability challenge and existing approaches that address scalability of PPRL in

Sect. 4. In Sect. 5, we describe the challenges and existing techniques of PPRL on

multiple databases, which is an emerging research avenue that is being increasingly

required in many Big Data applications. In Sect. 6we discuss research directions in

2http://ec.europa.eu/justice/ data-protection/ index_en.htm [Accessed: 15/06/2016].

3http://www.hhs.gov/ocr/privacy/ [Accessed: 15/06/2016].

854 D. Vatsalan et al.

PPRL for Big Data, and in Sect. 7we conclude this chapter with a brief summary of

the topic covered.

2 Background

Building on the introduction to record linkage and privacy-preserving record link-

age (PPRL) in Sect. 1, we now present background material that contributes to the

understanding of the preliminaries. We describe the basic concepts and challenges

in Sect.2.1, and then describe the process of PPRL in Sect. 2.2.

2.1 Overview and Challenges of PPRL

Record linkage is a widely used data pre-processing and data cleaning task where the

aim is to link and integrate records that refer to the same entity from two or multiple

disparate databases. The record pairs (when linking two databases) or record sets

(when linking more than two databases) are compared and classiﬁed as ‘matches’ by

a linkage model if they are assumed to refer to the same entity, or as ‘non-matches’

if they are assumed to refer to different entities [26,54]. The frequent absence of

unique entity identiﬁers across the databases to be linked makes it impossible to use

asimpleSQL-join[30], and therefore linkage requires sophisticated comparisons

between a set of QIDs (such as names and addresses) that are commonly available

in the records to be linked. However, these QIDs often contain personal information

and therefore revealing or exchanging them for linkage is not possible due to privacy

and conﬁdentiality concerns.

As an example scenario, assume a demographer who aims to investigate how

mortgage stress (having to pay large sums of money on a regular basis to pay off

a house) is affecting people with regard to their mental and physical health. This

research will require data from ﬁnancial institutions as well as hospitals as shown

in Tables 1and 2. Neither of these organizations is likely willing or allowed by law

to provide their databases to the researcher. The researcher only requires access to

some attributes of the records (such as loan type, balance amount, blood pressure,

and stress level) that are linked across these databases, but not the actual identities

of the individuals that were linked. However, personal details (such as name, age or

date of birth, gender, and address) are needed as QIDs to conduct the linkage due to

the absence of unique identiﬁers across the databases.

As illustrated in the above example application (shown in Tables1and 2), link-

ing records in a privacy-preserving context is important, as sharing or exchanging

sensitive and conﬁdential personal data (contained in QIDs of records) between

organizations is often not feasible due to privacy concerns, legal restrictions, or

commercial interests. Therefore, databases need to be linked in such ways that no

sensitive information is being revealed to any of the organizations involved in a cross-

Privacy-Preserving Record Linkage for Big Data: Current Approaches … 855

Table 1 Example bank database

ID Given_name Surname DOB Gender Address Loan_type Balance

6723 Peter Robert 20.06.72 M16 Main

Street

2617

Mortgage 230,000

8345 Smith Roberts 11.10.79 M645

Reader

Ave 2602

Personal 8,100

9241 Amelia Millar 06.01.74 F49E

Apple-

cross Rd

2415

Mortgage 320,750

Table 2 Example health database

PID Last_name First_name Age Address Sex Pressure Stress Reason

P1209 Roberts Peter 41 16 Main

St 2617

m140/90 High Chest pain

P4204 Miller Amelia 39 49 Apple-

cross

Road

2415

f120/80 High Headache

P4894 Sieman Jeff 30 123

Norcross

Blvd 2602

m110/80 Normal Checkup

organizational linkage project, and no adversary is able to learn anything about these

sensitive data. This problem has been addressed by the emerging research area of

PPRL [164].

The basic ideas of PPRL techniques are to mask (encode) the databases at their

sources and to conduct the linkage using only these masked data. This means no

sensitive data are ever exchanged between the organizations involved in a PPRL

protocol, or revealed to any other party. At the end of such a PPRL process, the

database owners only learn which of their own records match with a high similar-

ity with records from the other database(s). The next steps would be exchanging

the values in certain attributes of the matched records (such as loan type, balance

amount, blood pressure, and stress level in the above example) between the database

owners, or sending selected attribute values to a third party, such as a researcher

who requires the linked data for their project [164]. Recent research outcomes and

experiments conducted in real health data linkage validate that PPRL can achieve

linkage quality with only small loss compared to traditional record linkage using

unencoded QIDs [134,135].

Using PPRL for Big Data involves many challenges, among them the follow-

ing three key challenges need to be addressed to make PPRL viable for Big Data

applications:

856 D. Vatsalan et al.

1. Scalability: The number of comparisons required for classifying record pairs or

sets equals to the product of the size of the databases that are linked. This is a

performance bottleneck in the record linkage process since it potentially requires

comparison of all record pairs/sets using expensive comparison functions [9,

31]. Due to the increasing size of Big Data (volume), comparing all records is

not feasible in most real-world applications. Blocking and ﬁltering techniques

have been used to overcome this challenge by eliminating as many comparisons

between non-matching records as possible [9,29,150].

2. Linkage quality: The emergence of Big Data brings with it the challenge of deal-

ing with typographical errors and other variations in data (variety and veracity)

making the linkage more challenging. The exact matching of QID values, which

would classify pairs or sets of records as matches if their QIDs are exactly the

same and as non-matches otherwise, will likely lead to low linkage accuracy in

the presence of real-world data errors. In addition, the classiﬁcation models used

in record linkage should be effective and accurate in classifying matches and non-

matches [31]. Therefore, for practical record linkage applications, techniques are

required that facilitate both approximate matching of QID values for comparison,

as well as effective classiﬁcation of record pairs/sets for high linkage accuracy.

3. Privacy: The privacy-preserving requirement in the record linkage process adds

a third challenge, privacy, to the two main challenges of scalability and linkage

quality [164]. Linking Big Data containing massive amounts of personal data

generally involves privacy and conﬁdentiality issues. Privacy needs to be consid-

ered in all steps of the record linkage process as only the masked (or encoded)

records can be used, making the task of linking databases across organizations

more challenging. Several masking techniques have been used for PPRL, as we

will discuss in detail in Sect. 3.4.

2.2 The PPRL Process and Techniques Used

In this section we discuss the steps and the techniques used in the PPRL process, as

shown in Fig. 1.

Data Pre-processing and Masking: The ﬁrst important step for quality linkage

outcomes is data pre-processing. Real-world data are often noisy, incomplete and

inconsistent [8,128], and they need to be cleaned in this step by ﬁlling in missing

data, removing unwanted values, transforming data into well-deﬁned and consistent

forms, and resolving inconsistencies in data representations and encodings [28,36].

In PPRL, data masking (encoding) is an additional step. Data pre-processing and

masking can be conducted independently at each data source. However, it is important

that all database owners (or parties) who participate in a PPRL project conduct the

same data pre-processing and masking steps on the data they will use for linking.

Some exchange of information between the parties about what data pre-processing

Privacy-Preserving Record Linkage for Big Data: Current Approaches … 857

processing

Data pre−

Database

Data

masking

Data

masking

Matches

matches

Possible

Evaluation

Non−

Comparison tion

Classifica−

Clerical

review matches

Blocking/

filtering

Blocking/

filtering

Database

Dprocessing

Data pre−

Fig. 1 Outline of the general PPRL process as discussed in Sect. 2.2. The steps shown in dark out-

lined boxes need to be conducted on masked database records, while dotted arrows show alternative

data ﬂows between steps

and masking approaches they use, as well as which attributes to be used as QIDs, is

therefore required [164].

Blocking/ﬁltering: Blocking/ﬁltering is the second step of the process, which is

aimed at reducing the number of comparisons that need to be conducted between

records by pruning as many pairs or sets of records as possible that unlikely corre-

spond to matches [9,29]. Blocking groups records according to a blocking criteria

(blocking key) such that comparisons are limited to records in the same (or similar)

blocks, while ﬁltering prunes potential non-matches based on their properties (e.g.

length differences of QIDs) [29]. The output of this step are candidate record pairs

(or sets) that contain records that are potentially matching, which need to be com-

pared in more detail. Blocking/ﬁltering can either be conducted on masked records

or locally by the database owners on unmasked records. The scalability challenge of

PPRL has been addressed by several recent approaches using private blocking and

ﬁltering techniques [46,78,131,133,149,150,159,163], as will be described in

Sects. 4and 5.1.

Comparison: Candidate record pairs (or sets) are compared in detail in the com-

parison step using comparison (or similarity) functions [32]. Various comparison

functions have been used in record linkage including Levenshtein edit distance,

Jaro-Winkler comparison, Soft-TFIDF string comparison, and token-based compar-

ison using the Overlap, Dice, or Jaccard coefﬁcient [28]. These comparison functions

provide a numerical value representing the similarity of the compared QID values,

often normalized into the [0,1]interval where a similarity of 1 corresponds to two

values being exactly the same, and 0 means two values being completely different.

Several QIDs are normally used when comparing records, resulting in one weight

vector for each compared record pair that contains the numerical similarity values

of all compared QIDs.

858 D. Vatsalan et al.

The QIDs of records often contain variations and errors, and therefore simply

masking these values with a secure one-way hash-encoding function (as will be

described in Sect. 3.4) and comparing the masked values will not result in high linkage

quality for PPRL [35,122]. A small variation in a pair of QID values will lead to

completely different hash-encoded values [35], which enables only exactly matching

QID values to be identiﬁed with such a simple masking approach. Therefore, an

effective masking approach for securely and accurately calculating the approximate

similarity of QID values is required. Several approximate comparison functions have

been adapted into a PPRL context, including the Levenshtein edit distance [76] and

the Overlap, Dice, and Jaccard coefﬁcients [164].

Classiﬁcation: In the classiﬁcation step, the weight vectors of the compared candi-

date record pairs (or sets) are given as input to a decision model which will classify

them into matches, non-matches, and possible matches [31,54,63], where the latter

class is for cases where the classiﬁcation model cannot make a clear decision. A

widely used classiﬁcation approach for record linkage is the probabilistic method

developed by Fellegi and Sunter in the 1960s [54]. In this model, the likelihood that

a pair (or set) of records corresponds to a match or a non-match is modelled based

on a-priori error estimates on the data, frequency distributions of QID values, as well

as their similarity calculated in the comparison step [28]. Other classiﬁcation tech-

niques include simple threshold-based and rule-based classiﬁers [28]. Most PPRL

techniques developed so far employ a simple threshold-based classiﬁcation [164].

Supervised machine learning approaches, such as support vector machines and

decision trees [14,25,51,52], can be used for more effective and accurate classiﬁca-

tion results. These require training data with known class labels for matches and non-

matches to train the decision model. Once trained, the model can be used to classify

the remaining unlabelled pairs/sets of records. Such training data, however, are often

not available in real record linkage applications, especially in privacy-preserving set-

tings [28]. Alternatively, semi-supervised techniques (such as active learning-based

techniques [3,11,169]), that actively use examples manually labeled by experts

to train and improve the decision model, need to be developed for PPRL. Recently

developed advanced classiﬁcation models, such as (a) collective linkage [13,74] that

considers relational similarities with other records in addition to QID similarities, (b)

group linkage [123] that calculates group of records’ similarities based on pair-wise

similarities, and (c) graph-based linkage [58,66,74] that considers the structure

between groups of records, can achieve high linkage quality at the cost of higher

computational complexity. However, these advanced classiﬁcation techniques have

not been explored for PPRL so far.

Clerical review: The record pairs/sets that are classiﬁed as possible matches require

a clerical review process, where these pairs are manually assessed and classiﬁed into

matches or non-matches [171]. This is usually a time-consuming and error-prone

process which depends upon experience of the experts who conduct the review. Active

learning-based approaches can be used for clerical review [3,11,169]. However,

clerical review in its current form is not possible in a PPRL scenario since the actual

QID values of records cannot be inspected because this would reveal sensitive private

Privacy-Preserving Record Linkage for Big Data: Current Approaches … 859

information. Recent work in PPRL suggests an interactive approach with human-

machine interaction to improve the quality of linkage results without compromising

privacy [100].

Evaluation: The ﬁnal step in the process is the evaluation of the complexity, quality,

and privacy of the linkage to measure the applicability of a linkage project in an

application before implementing it into an operational system. A variety of evaluation

measures have been proposed [29,31]. Given in a practical record linkage application

the true match status of the compared record pairs are unlikely to be known, measuring

linkage quality is difﬁcult [6,31]. How to evaluate the amount of privacy protection

using a set of standard measures is still an immature aspect in the PPRL literature.

Vatsalan et al. recently proposed a set of evaluation measures for privacy using

probability of suspicion [165]. Entropy and information gain, between unmasked

and masked QID values, have also been used as privacy measures [46].

Tools: Different record linkage approaches have been implemented within a number

of tools. Koepcke and Rahm provided a detailed overview of eleven such tools

in [95] both categories of with and without the use of learning-based (supervised)

classiﬁcation. The comparative evaluation study [96] benchmarks selected tools from

both categories for four real-life test cases. It is found that learning-based approaches

achieve generally better linkage quality especially for complex tasks requiring the

combination of several attribute similarities. Current tools for link discovery, i.e.,

matching entities between sources of linked open data web, are surveyed in [119].

A web-based tool was recently developed to demonstrate several multi-party PPRL

approaches (as will be described in Sect. 5)[132].

3 Privacy Aspects and Techniques for PPRL

Several dimensions of privacy need to be considered for PPRL, the four main ones

being: (1) the number of parties and their roles, (2) adversary models, (3) privacy

attacks, and (4) data masking or encoding techniques. In Sects. 3.1–3.4, we describe

these four privacy dimensions, and in Sect. 3.5 we provide an overview of Bloom

ﬁlter-based data masking, a technique widely used in PPRL.

3.1 PPRL Scenarios

PPRL techniques for linking two databases can be classiﬁed into those that require a

linkage unit for performing the linkage and those that do not. The former are known

as ‘three-party protocols’ and the latter as ‘two-party protocols’ [24,27,167]. In

three-party protocols, a (trusted) third party acts as the linkage unit to conduct the

linkage of masked data received from the two database owners, while in two-party

860 D. Vatsalan et al.

DADB

Linkage

unit

2. Masked data

1. Parameters

2. Masked data

3. Record IDs

of matches

Database

owner Database

owner

3. Record IDs

of matches

DADB

1. Parameters

2. Masked data

of matches

Database

owner Database

owner

3. Record IDs

Fig. 2 Outline of PPRL protocols with (left) and without (right) a linkage unit (also known as

three-party and two-party protocols, respectively)

protocols only the two database owners participate in the PPRL process. A conceptual

illustration and the main communication steps involved in these protocols are shown

in Fig.2.

A further characterization of PPRL techniques is if they allow the linking of data

from more than two data sources (multi-party) or not. Multi-party PPRL techniques

identify matching record sets (instead of record pairs) from all parties (more than

two) involved in a linkage, or from sub-sets of parties. Only limited work has been

so far conducted on multi-party PPRL due to its increased challenges, as we will

describe in Sect. 5. Similar to linking two databases, multi-party PPRL may or may

not use a linkage unit to perform the linkage.

Protocols that do not require a linkage unit are more secure in terms of collusion

(described below) between one of the database owners and the linkage unit. However,

they generally require more complex techniques to ensure that the database owners

cannot infer any sensitive information about each other’s data during the linkage

process.

3.2 Adversary Models

Different adversary models are assumed in PPRL techniques, including the most

commonly used honest-but-curious (HBC) and malicious models [164].

1. Honest-but-curious (HBC) or semi-honest parties are curious in that they try

to ﬁnd out as much as possible about another party’s input to a protocol while

following the protocol steps [65,111]. If all parties involved in a PPRL protocol

have no new knowledge at the end of the protocol above what they would have

learned from the output, which is generally the record pairs (certain attributes)

classiﬁed as matches, then the protocol is considered to be secure in the HBC

model. However, it is important to note that HBC does not prevent parties from

colluding with each other with the aim to learn about another party’s sensitive

information [111]. Most of the existing PPRL solutions assume the HBC adver-

sary model.

2. Malicious parties can behave arbitrarily in terms of refusing to participate in

a protocol, not following the protocol in the speciﬁed way, choosing arbitrary

Privacy-Preserving Record Linkage for Big Data: Current Approaches … 861

values for their data input, or aborting the protocol at any time [110]. Limited

work has been done in PPRL for the malicious adversary model [57,105,118].

Evaluating privacy under this model is very difﬁcult, because there exist poten-

tially unpredictable ways for malicious parties to deviate from the protocol that

cannot be identiﬁed by an observer [21,62,111].

3. Covert and accountable computing models are advanced adversary models

developed to overcome the problems associated with the HBC and malicious

models. The HBC model is not sufﬁcient in many real-world applications because

it is suitable only when the parties essentially trust each other. On the other hand,

the solutions that can be used with malicious adversaries are generally more com-

plex and have high computation and communication complexities, making their

applications not scalable to large databases. The covert model guarantees that

the honest parties can detect the misbehavior of an adversary with high prob-

ability [4], while the accountable computing model provides accountability for

privacy compromises by the adversaries without excessive complexity and cost

that incur with the malicious model [72]. Research is required towards transform-

ing existing HBC or malicious PPRL protocols into these models and proving

privacy of solutions under these models.

3.3 Attacks

The privacy attacks or vulnerabilities that a PPRL technique is susceptible to allow

theoretical and empirical analysis of the privacy guarantees provided by the PPRL

technique. The main privacy attacks of PPRL are:

1. Dictionary attack is possible with masking functions, where an adversary masks

a large list of known values using various existing masking functions until a

matching masked value is identiﬁed [164]. A keyed masking approach, such as

the Hashed Message Authentication Code (HMAC) [97], can be used to prevent

dictionary attacks. With HMAC the database owners exchange a secret code

(string) that is added to all database values before they are masked. Without

knowing the secret key, a dictionary attack is unlikely to be successful.

2. Frequency attack is still possible even with a keyed masking approach [164],

where the frequency distribution of a set of masked values is matched with the

distribution of known unmasked values in order to infer the original values of the

masked values [112].

3. Cryptanalysis attack is a special category of frequency attack that is applicable

to Bloom ﬁlter-based data masking techniques. As Kuzu et al. [101]haveshown,

depending upon certain parameters of Bloom ﬁlter masking, such as the number

of hash functions employed and the number of bits in a Bloom ﬁlter, using a

constrained satisfaction solver allows the iterative mapping of individual masked

values back to their original values.

862 D. Vatsalan et al.

4. Composition attack can be successful by combining knowledge from more than

one independent masked datasets to learn sensitive values of certain records [60].

An attack on distance-preserving perturbation techniques [155], for example,

allows the original values to be re-identiﬁed with high level of conﬁdence if

knowledge about mutual distances between values is available.

5. Collusion is another vulnerability associated with multi-party or three-party

PPRL techniques, where some of the parties involvedin the protocol work together

to ﬁnd out about another database owner’s data. For example, one or several data-

base owners collude with the linkage unit or a sub-set of database owners collude

among them to learn about other parties’ data.

3.4 Data Masking or Encoding

In PPRL, the linkage has to be conducted on a masked or encoded version of the QIDs

to preserve the privacy of entities. Data masking (encoding) transforms original data

in such a way that there exists a speciﬁc functional relationship between the original

data and the masked data [55]. Several data masking functions have been used to

preserve privacy while allowing the linkage. We categorize existing data masking

techniques into three: (1) auxiliary, (2) blocking, and (3) matching techniques. Aux-

iliary techniques are the ones used as helper functions in PPRL, while blocking and

matching categories are used for private blocking and matching (comparison and

classiﬁcation), respectively. In the following we describe key techniques in each of

these three categories.

•Auxiliary:

1. Pseudo random function (PRF) is a deterministic secure function that, when

given an n-bit seed kand an n-bit argument x, returns an n-bit string fk(x)such

that it is infeasible to distinguish fk(x)for different random kfrom a truly random

function [114]. In PPRL, PRFs have been used to generate random secret values

to be shared by a group of parties [57,122,151].

2. Reference values constructed either with random faked values, or values that for

example are taken from a public telephone directory, such as all unique surnames

and town names, have been used in several PPRL approaches [77,124,141,173].

Such lists of reference values can be used to calculate the distances or similarities

between QID values in terms of the distances or similarities between QID and

reference values.

3. Noise addition in the form of extra records or QID values that are added to the

databases to be linked is a data perturbation technique [86] that can be used to

overcome the problem of frequency attacks on PPRL protocols [44,100]. An

example is shown in Fig. 3. Adding extra records, however, incurs a cost of lower

linkage quality (due to false matches) and scalability (due to extra records that

need to be processed and linked) [79]. Section 4.1 discusses noise addition for

private blocking.

Privacy-Preserving Record Linkage for Big Data: Current Approaches … 863

QID valuesPhonetic code

m460

p360

s530

millar myler

peter

smith smyth smitth

QID values

2re0a4i

yn40s21

zm0r04h 4hd0ffd

phonetic code

Hash−encoded Hash−encoded

51dc3jh6le

e78gh7la3i

2jdf60q 72gfi2b

46sjb321p0 i29uh7s

o3zkatn7v2lki0

Phonetic code

s530

p360

m460 millar myler miller

peter petar pete

smith smyth smitth

QID values + noise

Fig. 3 An example of phonetic encoding, noise addition, and secure hash-encoding (adapted

from [164]). Values represented with dotted outlines are added noise to overcome frequency

attacks [165] (as will be discussed in detail in Sect. 4.1)

4. Differential privacy [50] has emerged as an alternative to random noise addition

in PPRL. Only the perturbed results (with noise) of a set of statistical queries

are disclosed to other parties, such that the probability of holding any property

on the results is approximately the same whether or not an individual value is

present in the database [50]. In recent times, differential privacy has been used in

statistical (e.g. counts or frequencies) microdata publication as well as in PPRL

[16,68,103].

•Blocking:

1. Phonetic encoding, such as Soundex, NYSIIS or Double-Metaphone, groups

values together that have a similar pronunciation [23] in a one-to-many mapping,

as shown in Fig. 3. The main advantage of using a phonetic encoding for PPRL

is that it inherently provides privacy [79], reduces the number of comparisons,

and thus increases scalability [23], and supports approximate matching [23,79].

Two drawbacks of phonetic encodings are that they are language dependent

[126,146] and are vulnerable to frequency attacks [165]. Section 4.1 discusses

phonetic encoding-based blocking in more details.

2. Generalization techniques overcome the problem of frequency attacks on

records by generalizing records in such a way that re-identiﬁcation of individual

records from the masked data is not feasible [106,115,152]. k-anonymity is a

widely used generalization technique for PPRL [75,77,118], where a database is

said to be k-anonymous if every combination of masked QID values (or blocking

key values) is shared by at least krecords in the database [152]. Other gen-

eralization techniques include value generalization hierarchies [67], top-down

specialization [118], and binning [108,162].

•Matching:

1. Secure hash-encoding is one of the ﬁrst techniques used for PPRL [17,49,127].

The widely known Message Digest (like MD5) and Secure Hash Algorithms (like

SHA-1 and SHA-2) [143] are one-way hash-encoding functions [97,143] that

can be used to mask values into hash-codes (as shown in Fig.3) such that having

access to only hash-codes will make it nearly impossible with current computing

technology to infer their original values. A major problem with this masking

technique is, however, that only exact matches can be found [49]. Even a single

864 D. Vatsalan et al.

character difference between a pair of original values will result in two completely

different hash-codes.

2. Statistical linkage key (SLK) is a derived variable generated from components

of QIDs. The SLK-581 was developed by the Australian Institute of Health

and Welfare (AIHW) to link records from the Home and Community Care

datasets [140]. The SLK-581 consists of the second and third letters of ﬁrst

name, the second, third and ﬁfth letters of surname, full date of birth, and sex.

Similarly, SLK consisting of month and year of birth, sex, zipcode, and initial of

ﬁrst name was used for linking the Belgian national cancer registry [157]. How-

ever, as a recent study has shown these SLK-based masking provides limited

privacy protection and poor sensitivity [136].

3. Embedding space embeds QID values into a multi-dimensional metric space

(such as Euclidean [16,141,173]orHamming[81]) while preserving the dis-

tances between these values using a set of pivot values that span the multi-

dimensional space. A drawback with this approach is that it is often difﬁcult to

determine a good dimension of the metric space and select suitable pivot values.

4. Encryption schemes, such as commutative [1] and homomorphic [92] encryp-

tion, are used in PPRL techniques to allow secure multi-party computation (SMC)

in such a way that at the end of the computation no party knows anything except

its own input and the ﬁnal results of the computation [38,62,111]. The secure

set union, secure set intersection, and secure scalar product, are the most com-

monly used SMC techniques for PPRL [38,143]. A drawback of these crypto-

graphic encryption schemes for SMC, however, is that they are computationally

expensive.

5. Bloom ﬁlter is a bit vector data structure into which values are mapped by using

a set of hash functions. Bloom ﬁlters have been widely used in PPRL for private

matching of records as they provide a means of privacy assurance [46,47,76,

104,147,170], if effectively used [102]. We will discuss Bloom ﬁlter masking

in more detail in the following section.

6. Count-min sketches are probabilistic data structures (similar to Bloom ﬁlters)

that can be used to hash-map values along with their frequencies in a sub-linear

space [41]. Count-min sketches have been used in PPRL where the frequency of

occurrences of a matching pair/set also needs to be identiﬁed [80,139]. However,

these approaches only support exact matching of categorical values.

Other privacy aspects in a PPRL project are the secure generation and exchange

of public/private key pairs, employee conﬁdentiality agreements to reduce internal

threats, as well as encrypted communication, secure connections, and secure servers

to reduce external threats.

Privacy-Preserving Record Linkage for Big Data: Current Approaches … 865

3.5 Bloom Filters

Bloom ﬁlter encoding has been used as an efﬁcient masking technique in a variety

of PPRL approaches [46,104,130,148,150,158,160]. A Bloom ﬁlter biis a

bit vector data structure of length lbits where all bits are initially set to 0. kinde-

pendent hash functions, h1,h2,...,hk, each with range 1,...l, are used to map

each of the elements sinasetSinto the Bloom ﬁlter by setting the bit positions

h1(s), h2(s),...,hk(s)to 1. The Bloom ﬁlter was originally proposed by Bloom [15]

for efﬁciently checking set membership [19].Laietal.[104] ﬁrst adopted the con-

cept of using Bloom ﬁlters in PPRL for identifying exactly matching records across

multiple databases, as will be described in Sect. 5.2.

Schnell et al. [148] were the ﬁrst to propose a method for approximate matching

in PPRL using Bloom ﬁlters. In their approach, the character q-grams (sub-strings of

length q) of QID values of each record in the databases to be linked are hash-mapped

into a Bloom ﬁlter using kindependent hash functions. The resulting Bloom ﬁlters

are sent to a linkage unit that calculates the similarity between Bloom ﬁlters using a

set-based similarity function, such as the Dice-coefﬁcient [28]. The Dice-coefﬁcient

of two Bloom ﬁlters (b1and b2) is calculated as:

Dice_sim(b1,b2)=2×c

(x1+x2),(1)

where cis the number of common bit positions that are set to 1 in both Bloom ﬁlters

(common 1-bits), and xiis the number of bit positions set to 1 in bi(1-bits), i∈{1,2}.

An example similarity calculation is illustrated in Fig. 4.

Bloom ﬁlters are susceptible to cryptanalysis attacks, as shown by Kuzu et

al. [101]. Using a constrained satisfaction solver, such attacks allow the iterative

mapping of individual hash-encoded values back to their original values depending

upon the number of hash functions employed and the length of a Bloom ﬁlter. Differ-

ent Bloom ﬁlter encoding methods have been proposed in the literature to overcome

such cryptanalysis attacks and improve linkage quality. Schnell et al.’s proposed

method of hash-mapping all QID values of a record into one composite Bloom ﬁlter

is known as Cryptographic Long-term Key (CLK) encoding [148].

Fig. 4 An example

similarity (Dice-coefﬁcient)

calculation of Bloom ﬁlters

for approximate matching

using Schnell et al.’s

approach [147](taken

from [164]) pe et te

teetre

00 01

00 0 0

Num common

1−bits Dice_sim =

Num 1−bits

= 0.83

(7+5)

2 x 5

c = 5

866 D. Vatsalan et al.

Durham et al. [48] investigated composite Bloom ﬁlter encoding in detail by

ﬁrst hash-mapping different attributes into attribute-level Bloom ﬁlters of differ-

ent lengths. These lengths depend upon the weights [54] of QID attributes that are

calculated using the discriminatory power of attributes in separating matches from

non-matches using a statistical approach. These attribute-level Bloom ﬁlters are then

combined into one record-level Bloom ﬁlter (known as RBF) by sampling bits from

each attribute-level Bloom ﬁlter. Vatsalan et al. [165] recently introduced a hybrid

method of CLK and RBF (known as CLKRBF) where the Bloom ﬁlter length is kept

to be the same (as in CLK) while using different numbers of hash functions to map

different attributes into the Bloom ﬁlter based on their weights (to improve matching

quality as in RBF).

Several non-linkage unit-based approaches have also been proposed for PPRL

using Bloom ﬁlter masking, where the database owners (without a linkage unit)

collaboratively (or distributively) calculate the similarity of Bloom ﬁlters [104,158,

160]. A recent work proposed novel Bloom ﬁlter-based masking techniques that allow

approximate matching of numerical data in PPRL [161]. Instead of hash-mapping q-

grams of a string, the proposed approaches hash-map a set of neighbouring numerical

values to allow approximate matching.

4 Scalability Techniques for PPRL

PPRL for Big Data needs to scale to very large data volumes of many millions of

records from multiple sources. As for standard record linkage, the main techniques

for high efﬁciency are to reduce the search space by blocking and ﬁltering approaches

and to perform record linkage in parallel on many processors.

These three kinds of optimization are largely orthogonal so that they may be

combined to achieve maximal efﬁciency. Blocking is deﬁned on selected attributes

(blocking keys) that may be different from the QID attributes used for comparison, for

example zip code. It partitions the records in a database into several blocks or clusters

such that comparison can be restricted to the records of the same block, for example

persons with the same zip code. Other blocking approaches like sorted neighborhood

work differently but are similar in spirit. Filtering is an optimization for the particular

comparison approach which optimizes the evaluation of a speciﬁc similarity measure

for a predeﬁned similarity threshold to be met by matching records. It thus utilizes

different ﬁltering or indexing techniques to eliminate pairs (or sets) of records that

cannot meet the similarity threshold for the selected similarity measures [28,43].

Such techniques can be applied for comparison within blocks, i.e., ﬁltering could be

utilized in addition to blocking.

In the next two subsections, we discuss several proposed blocking and ﬁltering

approaches for PPRL. We then brieﬂy discuss parallel PPRL which has found only

limited attention so far. We will focus on PPRL with two data sources (multi-party

PPRL is discussed in Sect. 5). We will furthermore mostly assume the use of a

dedicated linkage unit (as shown on the left-hand side in Fig. 2)aswellasthemasking

Privacy-Preserving Record Linkage for Big Data: Current Approaches … 867

of records using Bloom ﬁlters (as described in Sect. 3.5). Note that a linkage unit is

ideally suited for high performance as it requires minimal communication between

the database owners, and it can utilize a high performance cluster for parallel PPRL

as well as blocking and ﬁltering techniques.

4.1 Blocking Techniques

Blocking aims at reducing the search space for linkage by avoiding the comparison

between every pair of records and its associated quadratic complexity. There are

numerous blocking strategies [28] that mostly group records into disjoint or overlap-

ping blocks such that only records within the same block need to be compared with

each other. Standard blocking uses the values of a so-called blocking key to partition

all records into disjoint blocks. The blocking key values (BKVs) may be the values

of a selected attribute or the result of a function on one or several attribute values

(e.g. the concatenation of the ﬁrst two letters of last name and year of birth). Other

blocking approaches include canopy clustering that results in overlapping clusters

or blocks, and sorted neighborhood that sorts records according to a sorting key

and only compares neighboring records within a certain window [28]. Comparing

records only within the predetermined blocks may result in the loss of some matches

especially if some BKVs are incorrect or missing. To improve recall a multi-pass

blocking approach can be utilized, where records are blocked according to different

blocking keys, at the cost of an increased number of additional comparisons.

Blocking for PPRL is based on the known approaches for regular record linkage

but aims at improving privacy. A general approach with a central linkage unit is to

apply a previously agreed on blocking strategy by the database owners on the original

records. Then all records within the different blocks are masked (encoded), e.g. using

Bloom ﬁlters, and sent to the linkage unit. The linkage unit can then compare the

masked records block-wise with each other. In the following, we present selected

blocking approaches for PPRL and discuss results from a comparative evaluation of

different blocking schemes.

Phonetic Blocking: Blocking records based on their phonetic code is a widely used

technique in record linkage [28]. The basic idea is to encode the values of a blocking

key attribute (e.g. last name) with a phonetic function (as discussed in Sect.3.4) such

as Soundex or Metaphone [28]. All records with the same phonetic code, i.e. with a

similar pronunciation, are then assigned to the same block. The phonetic blocking

has been used in several PPRL approaches, in particular in [76,79]. Karakasidis et

al. in [76] use a multi-pass approach with both Soundex and Metaphone encodings to

achieve a good recall. Furthermore, they add fake records to the blocks for improved

privacy.

As discussed in Sect. 3.4, adding fake records improves privacy but adds overhead

in the form of extra comparisons between records and can reduce linkage quality due

to the introduction of false matches. A theoretical analysis of the impact of adding

868 D. Vatsalan et al.

fake records for Soundex-based blocking is presented in [79]. The authors study the

effect of fake records on the so-called relative information gain which is related to the

entropy measure. A high entropy within blocks caused by fake records introduces

a high uncertainty and thus a low information gain [165]. The authors also study

different methods to inject fake records to increase entropy. The most ﬂexible of the

approaches is based on the concept of k-anonymity and adds as many fake records

as required to ensure that each block has at least krecords. The approach typically

requires only the addition of a modest number of fake records; the choice of kalso

supports ﬁnding a good trade-off between privacy and quality.

Blocking with Reference Values: An alternative to adding fake records for improv-

ing the privacy of blocking is the use of reference values (as discussed in Sect. 3.4).

The reference values can be used by the database owners for clustering their database

records. Comparison can then be restricted to the clusters (blocks) of the same or

similar reference records. Such an approach has been proposed in [77] based on k

nearest neighbor (kNN) clustering. This approach ﬁrst clusters the reference values

identically at each database owner such that each cluster contains at least kreference

values to ensure k-anonymity; clustering is based on the Dice-coefﬁcient similarity

between values. In the next step, each database owner assigns its records to the nearest

cluster based on the Dice-coefﬁcient between records and reference values. Finally

each database owner sends its clusters (encoded reference values and records) to the

linkage unit which then compares the records between corresponding clusters.

An alternate proposal utilizes a local sorted neighborhood clustering (SNC-3P)

for improved performance in the blocking phase while retaining the use of reference

values and support for k-anonymity [159]. Each database owner sorts a shared set

of reference values and then inserts its records into the sorted list according to their

sorting key. From the sorted list of reference values and records the initial Sorted

Neighborhood Clusters (SNCs) are determined such that each cluster contains one

reference value and a set of database records. To ensure k-anonymity, the initial clus-

ters are merged into larger blocks containing at least kdatabase records. This differs

from kNN clustering where kreference records are needed per cluster. The merging

of the initial clusters can be based on similarity or size constraints. The remaining

protocol with sending the encoded records to the linkage unit for comparison works

as for kNN clustering.

An adaptation of SNC-3P for two parties without a linkage unit (SNC-2P)was

presented in [163]. In this approach, the two database owners generate their reference

values independently, so that they end up with two different sets of reference values.

As for SNC-3P, each database owner sorts its reference values, inserts its records

into the sorted list, builds initial SNCs (with one reference value and its associated

records), and merges these clusters to guarantee k-anonymity. Afterwards the data-

base owners exchange their reference values. These values are then merged with

the local reference values and sorted. To ﬁnd candidate pairs between the sources a

sorted neighborhood method with a sliding window wis applied on these reference

values. The window size wdetermines the number of reference values originating

from each data source, e.g. for w=2 the sliding window includes 2 reference values

Privacy-Preserving Record Linkage for Big Data: Current Approaches … 869

from each data source. In the last step, the encoded records falling into a window are

exchanged for comparison.

LSH-based blocking: Locality-sensitive hashing (LSH) has been proposed to solve

the problem of nearest neighbor search in high dimensional spaces [61,69]. For

blocking, LSH uses a family of hash functions to generate keys used to partition the

records in a database, so that similar records are grouped into the same block [89].

Durham investigated the use of LSH for private blocking of records masked as Bloom

ﬁlters [46]. She considered two families of hash functions depending on the used dis-

tance function (Jaccard or Hamming distance). For the Jaccard distance, she proposed

the use of MinHash functions. A MinHash function hipermutes the bits of a Bloom

ﬁlter biand selects from the permuted Bloom ﬁlter the ﬁrst index position with a set

bit (1-bit). By applying φMinHash functions we obtain φindex positions which are

concatenated to generate the ﬁnal MinHash key. So for Bloom ﬁlter biand the family

of hash functions H, we determine key(bi)H=concat (h1(bi), h2(bi),...,hφ(bi)),

where hj∈Hwith 1 ≤j≤φand function concat() concatenates a set of input val-

ues. For the Hamming distance, Durham proposed the use of HLSH hash functions

that select the bit value of a Bloom ﬁlter at a random position ρ.Inthesamewayas

MinHash, φHLSH functions are applied on a Bloom ﬁlter biand the values of the

φselected bits are concatenated to obtain the ﬁnal hash key of bi.

Example Consider two Bloom ﬁlters b1=1100100011 and b2=1100100111,

two permutations p1=(10,7,1,3,6,4,2,9,5,8)and p2=(4,3,6,7,2,1,5,

10,9,8), and the MinHash family H1with two functions h1=Min(p1(·)) and

h2=Min(p2(·)), where Min(·)returns the ﬁrst position of a set bit in the input

bit vector, and pi(·)returns the input bit vector permuted using pi. The applica-

tion of h1and h2on b1results in h1(b1)=Min(p1(b1)) =Min(1010001110)=1

and h2(b1)=Min(p2(b1)) =Min(0000111110)=5. Hence the key of b1in H1

is key(b1)H1=concat(h1(b1), h2(b1)) =(1,5). In the same way we determine the

key of b2, i.e. key(b2)H1=(1,5). Hence, for MinHash family H1records b1and b2

are put into the same block and will be compared with each other.

Both families, MinHash and HLSH, depend on two parameters: the number of

hash functions φas well as the number of passes or iterations μ. Since the ﬁnal hash

key of a record concatenates φvalues, using a high φleads to more homogeneous

blocks and better precision (i.e., blocks containing similar records with higher prob-

ability). However a high φalso increases the probability of missing true matches

(reduced recall). This problem is addressed by applying μiterations, each with a

different set of hash functions. Therefore each record biwill be hashed to several

blocks to allow identifying more true matches.

In [83] the authors present a theoretical analysis of the use of MinHash functions

to identify good values for parameters φopt and μopt to efﬁciently achieve a good

precision and recall. The naïve approach to improve recall is to increase the number

of iterations μand thus the number of blocks to which records are assigned. The

drawbacks of this method are the high runtime caused by the computation of the

permutations, increased number of record pairs to compare, and the large space

needed to store intermediate results. This observation was experimentally conﬁrmed

870 D. Vatsalan et al.

in [46]. The choice of the φopt is also complex and depends on the expected running

time (for details see [83]).

Evaluation of Private Blocking Approaches: The relatively large number of pos-

sible blocking approaches requires detailed evaluations regarding their relative scal-

ability, blocking quality and privacy for different kinds of workloads. One of the few

studies in this respect has been presented by Vatsalan et al. [165]. For scalability

they considered runtime and the so-called reduction ratio (RR), a value indicating

the number of pruned candidate pairs compared to all possible record pairs (which

thus evaluates to what degree the search space is reduced). For blocking quality they

considered the recall and precision metrics pair completeness (PC) and pair quality

(PQ), respectively [29]. For privacy they estimated the so-called disclosure risk (DR)

measures, which represent the probability that masked records/QID values can be

linked with records or values in a publicly available dataset.

The evaluation of [165] considers six simulated blocking strategies including

kNN [77], SNC-3P [159], SNC-2P [163] and LSH blocking [46]. Regarding blocking

runtime, the SNC and LSH schemes performed best. All strategies except SNC-2P

achieved a very high RR of almost 1. On the other hand, SNC-2P achieved the best

PC. The best trade-off between RR and PC was observed for LSH. Considering the

privacy aspect, SNC-2P was found to have a low DR while kNN and LSH generally

expose the highest DR.

While this study provides interesting results, we see a need for additional bench-

mark studies given that further blocking schemes have been developed more recently

and that the relative behavior of each approach depends on numerous parameter set-

tings as well as characteristics of the chosen datasets.

4.2 Filtering Techniques

Almost all proposed PPRL schemes based on Bloom ﬁlters aim at identifying

the pairs of bit vectors with a similarity above a threshold. For regular record

linkage, such a threshold-based comparison of record pairs is known as a sim-

ilarity join [39]. The efﬁcient processing of such similarity joins for different

kinds of similarity measures has been the focus of much research in the past, e.g.

[2,64,138,172]. Several approaches utilize the characteristics of the considered

similarity measure and the prespeciﬁed similarity threshold to reduce the search

space thereby speeding up the linkage process. This holds especially for the broad

class of token-based similarity joins where the comparison of records is based on the

set of tokens (e.g. q-grams) of QIDs. In this case, one can exclude all pairs of records

that do not share at least one token. Further proposed optimizations for such similar-

ity joins include the use of different kinds of ﬁlters (for example, length and preﬁx

ﬁlters) and dynamically created inverted indexes [10]. PPJoin [172] is an efﬁcient

approach that includes these kinds of optimizations. Several ﬁltering approaches

Privacy-Preserving Record Linkage for Big Data: Current Approaches … 871

also utilize the characteristics of similarity functions for metric spaces to reduce the

search space, in particular the so-called triangle inequality (see below) [174].

For PPRL, similar ﬁltering (similarity join) approaches are usable but need to be

adapted to the comparison of masked records such as Bloom ﬁlters and the associated

similarity measures. For Bloom ﬁlters it is relatively easy to apply the known token-

based similarity measures by considering the set bit positions (1-bits) in the bit vectors

as the “tokens”. This has already been shown in Sect.3.5 for the Dice-coefﬁcient

similarity which is based on the degree of overlapping bit positions. This is also the

case for the related Jaccard similarity. For two bit vectors b1and b2it is deﬁned as

follows:

Jacc_si m(b1,b2)=|b1∧b2|

|b1∨b2|=|b1∧b2|

|b1|+|b2|−|b1∧b2|,(2)

where |bi|denotes the number of set bits in bit vector biwhich is also called its

length or cardinality. For the example Bloom ﬁlter pair shown in Fig.4, the Jaccard

similarity is 5/7=0.71. In the following, we outline several ﬁltering approaches

that have been proposed for PPRL.

Length Filter: The similarity function Jacc_sim (as well as Di ce_sim) allows the

application of a simple length ﬁlter to reduce the search space. This is because the

minimal similarity (overlap of set bits) can only be achieved if the lengths (number of

set bits) of the two input records do not deviate too much. Formally, for two records

riand rjwith |ri|≤|rj|, it holds that

Jacc_si m(ri,rj)≥st⇒|ri|≥st·|rj| (3)

For example, two records cannot satisfy a (Jaccard) similarity threshold st=0.8if

their lengths differ by more than 20%. Hence for a similarity threshold of 0.8, the

length ﬁlter would already avoid the two records from the example Bloom ﬁlter pair

shown in Fig. 4without comparing in detail, since Eq.3(5 ≥0.8·7=6) does not

hold. The two-party PPRL approach proposed by Vatsalan and Christen uses such a

length ﬁlter for Dice-coefﬁcient similarity [158].

PPJoin for PPRL: The privacy-preserving version of PPJoin (called P4Join) utilizes

three ﬁlters to reduce the search space: the length ﬁlter as well as a preﬁx ﬁlter and

a position ﬁlter [150]. The preﬁx ﬁlter is based on the fact that matching bit vectors

need a high degree of overlap in their set bit positions in order to satisfy a predeﬁned

threshold. Pairs of records can thus be excluded from comparison if they have an

insufﬁcient overlap. This overlap test can be limited to a relatively small sub-set

of bit positions, e.g. in the beginning (or preﬁx) of the vectors. To maximize this

ﬁlter idea, P4Join applies a pre-processing to count for each bit position the number

of records where this bit is set to 1 and reorders the positions of all bit vectors

in ascending order of these frequency counts. This way the preﬁxes of bit vectors

contain infrequently set bit positions reducing the likelihood of an overlap with other

872 D. Vatsalan et al.

bit vectors. The position ﬁlter of P4Join can avoid the comparison of two records

even if their preﬁxes overlap depending on the preﬁx positions where the overlap

occurs. For more details of this ﬁlter we refer to [150].

As we will see in the comparative evaluation below, the ﬁltering approaches

achieve only a relatively small improvement for PPRL since the ﬁlter tests imply

already a certain overhead which is not much less than for the match tests (which

are relatively cheaper for Bloom ﬁlters). In addition, Bloom ﬁlter masking for PPRL

should ideally have 50% of their bits set to 1in order to make them less vulnerable

to frequency attacks [117], making P4Join less effective.

Multi-bit Trees: The use of multi-bit trees was originally proposed for fast similarity

search in large databases of chemical ﬁngerprints (masked into Bloom ﬁlters) [98].

A query Bloom ﬁlter bqis being searched for in a database to retrieve all elements

whose similarity with bqis above the threshold st. A multi-bit tree is a binary tree that

iteratively assigns ﬁngerprints to its nodes based on so-called match bits. A match

bit refers to a speciﬁc position of the bit vector and can be 1 or 0: it indicates that all

ﬁngerprints in the associated subtree share the speciﬁed match bit. When building

up the multi-bit tree, one match bit or multiple such bits are selected in each step

so that the number of unassigned ﬁngerprints can be roughly split by half. The split

is continued as long as the number of ﬁngerprints per node does not fall under a

limit ([98] recommends a limit of 6). The match bits can then be used for a query

ﬁngerprint to determine the maximal possible similarity for subtrees when traversing

the tree and can thereby eliminate many ﬁngerprints to compare.

As suggested in [5], multi-bit trees can easily be applied for PPRL using Bloom

ﬁlters and Jaccard similarity. For two datasets, the larger input dataset is used to build

the multi-bit trees while each record (ﬁngerprint) of the second dataset is used for

searching similar records. The multi-bit approach of [5] partitions the ﬁngerprints

according to their lengths such that all ﬁngerprints with the same length belong to the

same partition (or bucket). To apply the length ﬁlter, we can then restrict the search

for similar ﬁngerprints to the partitions meeting the length criterion of Eq. 3. Query

efﬁciency is further improved by organizing all ﬁngerprints of a partition within a

multi-bit tree.

In evaluations of [5,145] the multi-bit tree approach was found to be very effec-

tive and even better or similarly effective than blocking approaches such as canopy

clustering and sorted neighborhood.

PPRL for Metric Space Similarity Measures: A metric space consists of a set of

data objects and a metric to compute the distance between the objects. The main

property of interest that a metric or distance function dfor metric spaces has to

satisfy is the so-called triangle inequality. It requires that for any objects x,yand z

it holds

d(x,z)≤d(x,y)+d(y,z)(4)

Privacy-Preserving Record Linkage for Big Data: Current Approaches … 873

Fig. 5 Triangle inequality:

Object ycannot lie within

the search radius of query

object qsince the difference

between d(p,q)and d(p,y)

exceeds rad(q)(taken

from [149]) pq

rad(q)

d(p,q)

Distance functions for metric spaces satisfying this property include the

Minkowski distances (for example, Euclidean distance), edit distance, Hamming dis-

tance and Jaccard-coefﬁcient (but not Dice-coefﬁcient) [174]. The triangle inequal-

ity has been used for private comparison and classiﬁcation in PPRL using reference

values [124,162].

The triangle inequality has also been used to reduce the search space for similarity

search and record linkage [7,12]. In both cases we have to ﬁnd for a query object

qthose similar objects xwith a distance d(q,x)lower than or equal to a maximal

distance threshold (or above a minimal similarity threshold) which can be seen as a

radius rad(q)around qin Fig.5. The triangle equality allows one to avoid comput-

ing the distance between two objects based on their precomputed distances to a third

reference object or pivot, such as object pin Fig. 5. Utilizing the precomputed dis-

tances d(p,q)and d(p,x)we only have to compute the distance d(q,x)for objects

xthat satisfy the triangle inequality d(p,q)−d(p,x)≤rad(q). In all other cases,

comparison can be avoided such as for object yin Fig. 5.

Several alternatives to utilize the triangle inequality to reduce the search space for

PPRL have been studied in [149], in particular for the Hamming distance which has

been shown to be equivalent to the Jaccard similarity [172]. The best performance

was achieved for a pivot-based approach that selects a certain number of data objects

from a sample of the ﬁrst dataset as pivots and assigns each other object of the

ﬁrst dataset to its closest pivot. For each pivot, the maximal distance (radius) for

its objects is also recorded. Pivots are iteratively determined from the sample set of

objects such that the object with the greatest distance to all previously determined

pivots becomes the next pivot. The rational behind this selection strategy is to have

a relatively large distance between pivots so that searching for similar objects can be

restricted to objects of relatively few pivots. Determining the pivots from a sample

rather than from all objects limits the overhead of pivot selection. The search for

similar (matching) objects can be restricted to the pivots for which there is a possible

overlap with the radius of the query objects. For the objects of the relevant pivots the

triangle inequality is further used to prune objects from the comparison.

Comparative Evaluation: The performance of pivot-based approaches for metric

similarity measures has been evaluated in [149] and compared with the use of P4Join

and multi-bit trees. The evaluation has been done for synthetically generated datasets

874 D. Vatsalan et al.

Table 3 PPRL runtime in minutes for different dataset sizes and ﬁltering approaches (taken

from [149])

Algorithms Datasets

100,000 200,000 300,000 400,000 500,000

NestedLoop 3.820.852.196.8152.6

Multi-bit Tree 2.611.326.550.075.9

P4Join 1.4 7.424.152.387.9

Pivots (Metric

Space)

0.2 0.4 0.9 1.3 1.7

of 100,000–500,000 records such that 80% of the records are in the ﬁrst dataset and

20% in the second. Bloom ﬁlters of length 1,000 bits are used to mask the QIDs of

records, and the comparison is based on a Jaccard similarity threshold of 0.8orthe

corresponding Hamming distance for the metric-space approach.

Table 3summarizes the runtimes of the different approaches as well as for a naïve

nested loop approach without any ﬁltering (all implemented using Java) on a standard

PC (Intel i7-4770, 3.4 GHz CPU with 16 GB main memory). The results show that

both multi-bit trees and P4Join perform similarly but achieve only modest improve-

ments (less than a factor of 2) compared to the naïve nested loop scheme. By contrast

the pivot-based metric space approach achieves order-of-magnitude improvements.

For the largest dataset it only needs 1.7min and is 40 times faster than using multi-bit

trees. A general observation for all approaches is that the runtimes increase more than

linearly (almost quadratically) with the size of datasets, indicating a potential scala-

bility problem despite the reduction of the search space. Substantially larger datasets

would thus create performance problems even for the best ﬁltering approach indicat-

ing that additional runtime improvements are necessary, e.g. by the use of parallel

PPRL.

4.3 Parallel PPRL

PPRL in Big Data applications involves the comparison of a large number of masked

records as the main part of the overall execution pipeline. Parallel linkage on multiple

processors aims at improving the execution time proportionally to the number of

processors [42,90,91]. This can be achieved by partitioning the set of all record

pairs to be compared, and conducting the comparison of the different partitions in

parallel on different processors. A special case would be to utilize a blocking approach

to compare the records in different blocks in parallel. In the following we discuss

two approaches for parallel PPRL that have been proposed: one utilizes graphics

processors or GPUs for parallel processing within a single machine, and the other

one is based on Hadoop and its MapReduce framework. Both approaches have also

been used for general record linkage.

Privacy-Preserving Record Linkage for Big Data: Current Approaches … 875

Parallel PPRL with GPUs: The utilization of Graphical Processing Units (GPUs) to

speed-up similarity computations is a comparatively new approach [56,120]. Modern

GPUs provide thousands of cores that allow for a massively-parallel application of

the same instruction set to disjoint data partitions. The availability of frameworks like

OpenCL and CUDA simplify the utilization of GPUs to parallelize general purpose

algorithms. The GPU programs (called kernels) are typically written in a dialect of

the general programming language C. Kernel execution requires the input and output

data to be transferred between the main memory of the host system and the memory

of the GPU, and it is important to minimize the amount of data to be transferred.

Further limitations are that there is no dynamic memory allocation on GPUs (all

resources required by a program need to be allocated a priori) and that only basic

data types (e.g., int, long, ﬂoat) and ﬁxed-length data structures (e.g., arrays) can be

used.

Despite such limitations, the utilization of GPUs is a promising approach to speed-

up PPRL. This is especially the case for Bloom ﬁlter masking where all records are

represented as bit vectors of equal length. These vectors can easily be stored in array

data structures on the GPU. Furthermore, similarity computations can be broken

down into simple bit operations which are easily processed by GPUs.

A GPU-based implementation for PPRL using the P4Join ﬁltering is described

in [150]. It sorts the bit vectors of the two input datasets initially according to their

number of set bits (1-bits) and partitions the set of bit vectors into equal-sized blocks

such that multiple of such blocks ﬁt into the GPU memory. Pairs of blocks are then

continuously loaded into the GPU for parallel comparison. To limit unnecessary data

transfers, the length ﬁlter (described in Sect. 4.2) is applied to avoid transferring pairs

of blocks that do not meet the length ﬁlter restriction. The kernel programs also apply

the preﬁx ﬁlter to save comparisons.

The evaluation in [150] showed that the GPU implementation is highly efﬁcient

and improves runtimes by a factor of 20, even for a low-proﬁle graphics card (Nvidia

GeForce GT 540 M with 96 CUDA cores@672 MHz, 1GB memory). It would be

interesting to realize GPU versions of other PPRL approaches and to utilize more

powerful graphics cards with thousands of cores for improved performance.

Hadoop-based Parallel PPRL: Many Big Data applications are based on local

Shared Nothing clusters driven by software from the open-source Hadoop ecosys-

tem for parallel processing. Depending on the data volume and needed degree of

parallelism up-to thousands of multi-processor nodes are utilized. A main reason for

the success of Hadoop is that its programming frameworks, in particular MapRe-

duce and newer platforms such as Apache Spark4or Apache Flink,5make it easy to

develop programs that can be automatically executed in parallel on Hadoop clusters.

Several approaches and implementations have utilized MapReduce for parallel

record linkage [93,166]. In its simplest form, the Map tasks read the input data in

parallel and apply a blocking key to assign each record to a block. Then the data

records are dynamically redistributed among the Reduce tasks such that all records

4http://spark.apache.org [Accessed: 15/06/2016].

5https://ﬂink.apache.org/ [Accessed: 15/06/2016].

876 D. Vatsalan et al.

Fig. 6 Parallel PPRL with MapReduce using LSH blocking [82], where the MinHash keys for

Bloom ﬁlters are computed in the Map phase and records with the same MinHash signature will be

sent to the same Reduce task for matching

with the same blocking key are sent to the same Reduce task. Comparison is then

performed block-wise and in parallel by the Reduce tasks. For highly skewed block

sizes this simple approach can result in load balancing problems for the Reduce tasks;

approaches to solve this data skew or load balancing problem are proposed in [94].

The sketched approach can in principle also be applied for parallelizing PPRL,

e.g., if the linkage unit utilizes a Hadoop cluster. One such approach for using MapRe-

duce to speed-up PPRL has been proposed in [82]. The authors apply a LSH-based

blocking method using the MinHash approach (see Sect. 4.1). The use of MapReduce

is rather straightforward and illustrated in Fig. 6. The Bloom ﬁlters of both sources

are initially stored in the distributed ﬁle system (HDFS) as chunks. In the Map phase,

records are read sequentially and for each Bloom ﬁlter ra set of jMinHash keys are

computed. The records are then redistributed so that records with the same key are

sent to the same Reduce task for comparison. The main drawback of this strategy is

that records may be compared several times by different Reduce tasks because they

could share many keys (as shown in Fig.6for records r2and s1). To overcome this

problem the authors proposed another strategy by chaining two MapReduce jobs,

where the ﬁrst one is similar to the described method except that the Reduce phase

only outputs the pairs of records’ identiﬁers instead of comparing the records. In the

second MapReduce job, duplicate records pairs are grouped at the same Reducer to

be compared only once. In this process, the Bloom ﬁlters are not redistributed (but

only their identiﬁers) by storing the Bloom ﬁlters in a relational database from where

they are read when needed. The evaluation of this parallel LSH approach in [82]was

Privacy-Preserving Record Linkage for Big Data: Current Approaches … 877

limited to only 2 and 4 nodes and small datasets (about 300,000 records) so that the

overall scalability of the approach remains unclear.

For future work, it would be valuable to investigate and compare different parallel

PPRL approaches utilizing the Hadoop ecosystem. The approaches could also utilize

the Spark or Flink frameworks which support more operators than only Map and

Reduce, and support efﬁcient distributed in-memory processing.

5 Multi-party PPRL

While there have been many different approaches proposed for PPRL [164], most

work thus far has concentrated on linking records from only two databases (or par-

ties). Only some approaches have investigated linking records from three or more

databases [75,104,118,122,127,130,131,160], with most of these being limited to

exact matching or matching of categorical data only, as will be discussed below. How-

ever, as the example applications described in Sect. 1have shown, linking data from

multiple databases is increasingly being required for several Big Data applications.

In the following, we describe existing techniques of multi-party private blocking and

private comparison and classiﬁcation for multi-party PPRL (MP-PPRL).

5.1 Multi-party Private Blocking Techniques

Private blocking for MP-PPRL is crucial due to the exponential growth of the com-

parison space with the number of databases linked. However, this has not been studied

until recently, making MP-PPRL not practical in real applications.

Tree-based approaches: The ﬁrst approach [130] is based on a single-bit tree

(adapted from multi-bit tree [98]) data structure, which is constructed iteratively

to arrange records (masked into Bloom ﬁlters) such that similar records are placed

into the same tree leaf while non-similar records are placed into different leaf nodes

in the tree. At each iteration, the set of Bloom ﬁlters in a tree node is recursively

split based on selected (according to a privacy criteria) bit positions which are agreed

upon by all parties. A drawback with this approach, however, is that it might miss

true matches due to the recursive splitting of Bloom ﬁlters. Furthermore, a commu-

nication step is required among all parties for each iteration.

This limitation of missing true matches in the single-bit tree-based approach [130]

has been addressed in [131] using a multi-bit tree [98] data structure (as we discussed

in Sect. 4.2) that is combined with canopy clustering. Multi-bit tree-based ﬁltering for

PPRL of two databases was ﬁrst introduced by Schnell [144]. In [131] the concept of

multi-bit trees was used to split the databases (masked into Bloom ﬁlters) individually

by the parties into small mini-blocks, which are then merged into larger blocks

878 D. Vatsalan et al.

0110011

1011011

1101110

1101010

RA1

RA2

RA3

RA4

1 01011

1011011

1101010

1BR0

RB2

RB3

110110

1011011

0100110

0101010

1RC1

RC2

RC3

RC4

<2, 3, 4, 5, 6>

<1, 5, 2, 7, 6>

CB1

CB2

<2, 3, 4, 5, 6>

<9, 8, 1, 2, 7>

CC1

CC2

<1, 5, 2, 7, 6>

Party PParty P

Party P

1100110

1011011

1101110

1101010

CA1

CA2

1 01011

1011011

1101010

CB1

CB2

110110

1011011

0100110

0101010

CC1

CC2

Local blocks

CA1, CB1, CC1

CA2, CB2

CA1, CB1, CC1

CA2, CB2

Bloom filters

Minhash

signatures

CA1

CA2

<2, 3, 4, 5, 6>

Linkage unit

LSH

Candidate

CA1, CB1, CC1

block sets

Fig. 7 Multi-party private blocking approach as proposed by Ranbaduge et al. [133] (adapted

from [133]). Candidate block sets from all three parties (CA1,CB1,CC1) and sub-set of two

parties (CA2,CB2) are identiﬁed to be compared and classiﬁed

according to privacy and computational requirements based on their similarity using

a canopy clustering technique [40].

Linkage unit-based approaches: A communication-efﬁcient approach for

multi-party private blocking by using a linkage unit was recently proposed [133],

as illustrated in Fig. 7. In the ﬁrst step of this approach, local blocks are generated

individually by each party using a private blocking technique (which is considered

to be a black box). For example, the private blocking approach based on multi-bit

tree and canopy clustering [131] (described above) can be used for local blocking.

A block representative in the form of a min-hash signature [18] is then generated

for each block and sent to a linkage unit. The linkage unit applies global blocking

using locality sensitive hashing (LSH) to identify the candidate block sets from all

parties or from sub-sets of parties based on the similarity between block representa-

rahm@informatik.uni-leipzig.de

Privacy-Preserving Record Linkage for Big Data: Current Approaches … 879

tives. Local blocking provides the database owners with more ﬂexibility and control

over their blocks while eliminating all communications among them. This approach

outperforms existing multi-party private blocking approaches in terms of scalability,

privacy, and blocking quality, as validated by a set of experiments conducted by the

authors [133].

Karapiperis and Verykios recently proposed a multi-party private blocking

approach based on LSH [84]. This approach uses Lindependent hash tables (or

blocking groups), each of which consists of key-bucket pairs where keys represent

the blocking keys and buckets host a linked list aimed at grouping similar records that

were previously masked into Bloom ﬁlters. Each hash table is assigned with a set of K

hash functions which is generated by a linkage unit and sent to all the database owners

to populate their set of blocks accordingly. The same authors extended this approach

by proposing a frequent pairs scheme (FPS) [85] for further reducing the number of

comparisons while maintaining a high level of recall. This approach achieves high

blocking quality by identifying similar record pairs that exhibit a number of LSH

collisions above a given threshold, and then performs distance calculations only for

those similar pairs. Empirical results showed signiﬁcant improvement in running

time due to a drastic reduction of candidate pairs by the FPS, while achieving high

blocking quality [85].

A major drawback of these multi-party private blocking techniques is that they

still result in an exponential comparison space with an increasing number of data-

bases to be linked, especially when the databases are large. Therefore, efﬁcient

communication patterns, such as ring-based or tree-based [113,142], as well as

advanced ﬁltering techniques, such as those discussed in Sect.4.2, need to be inves-

tigated for multi-party PPRL in order to make PPRL scalable and viable in Big Data

applications.

5.2 Multi-party Private Comparison and Classiﬁcation

Techniques

Several private comparison and classiﬁcation techniques for MP-PPRL have been

developed in the literature. However, they fall short in providing a practical solution

either because they allow exact matching only or theyare computationally not feasible

with the size and number of multiple databases. In the following we describe these

approaches and their drawbacks.

Secure Multi-party Computation (SMC)-based approach: An approach based

on SMC using an oblivious transfer protocol was proposed in [122] for multi-party

private comparison and classiﬁcation. While provably secure, the approach only

performs exact matching of masked records and it is computationally expensive

compared to efﬁcient perturbation-based privacy techniques such as Bloom ﬁlters

and k-anonymity [164].

880 D. Vatsalan et al.

Generalization-based approaches: A multi-party private comparison and classiﬁ-

cation approach was introduced in [75] to perform secure equi-join of masked records

from multiple k-anonymous databases by using a linkage unit. The database records

are k-anonymised by the database owners and sent to a linkage unit. The linkage

unit then compares and classiﬁes records by applying secure equi-join, which allows

exact matching only.

Another multi-party private comparison and classiﬁcation approach based on k-

anonymity for categorical values was proposed in [118]. In this approach, a top-

down generalization is performed on the QIDs to provide k-anonymous privacy (as

discussed in Sect. 3.4) and the generalized blocks are then classiﬁed into matches

and non-matches using the C4.5 decision tree classiﬁer.

Probabilistic data structure-based approaches: An efﬁcient multi-party private

comparison and classiﬁcation approach for exact matching of masked records using

Bloom ﬁlters was introduced by Lai et al. [104], as illustrated in Fig. 8. Each party

hash-maps their record values into a single Bloom ﬁlter and then partitions its Bloom

ﬁlter into segments according to the number of parties involved in the linkage. The

segments are exchanged among the parties such that each party receives a corre-

sponding Bloom ﬁlter segment from all other parties. The segments received by a

party are combined using a conjunction (logical AND) operation. The resulting con-

juncted Bloom ﬁlter segments are then exchanged among the parties to generate the

full conjuncted Bloom ﬁlter. Each party compares its Bloom ﬁlter of each record with

the ﬁnal conjuncted Bloom ﬁlter. If the membership test of a record’s Bloom ﬁlter is

successful then the record is considered to be a match across all databases. Though

the computation cost of this approach is low since the computation is completely

distributed among the parties without a linkage unit and the creation and processing

of Bloom ﬁlters are very fast, the approach can only perform exact matching.

123456789

(2, 5, 9)(3, 6, 8)(1, 4, 8) robertmillerpeter

peter

miller

robert

110110 10 1

robert

110110 10 1

11 1

10000

11110 0 010

110101 1

011001

0101

(AND)

111100010

456789231

34567892

final

result

Distributed conjunctionMapping of QID values

Exact matching

(non−match)(match) (non−match)

− QID value (1−bits)

Fig. 8 Bloom ﬁlter masking-based exact matching approach for MP-PPRL as proposed by Lai et

al. [104] (adapted from [164])

Privacy-Preserving Record Linkage for Big Data: Current Approaches … 881

11 1

10 1 0 1

10110011

110101 1

011000

01 0

(AND)

c = 2c = 1 c = 1

123

Dice_sim = 123

123

=(6+6+5)

= 0.706

x = 6

x = 5

(x +x +x )

3(c +c +c ) 3(1+2+1)

1−bits

Num common

Num 1−bits

Fig. 9 Bloom ﬁlter masking-based approximate matching approach for MP-PPRL proposed by

Vatsalan and Christen [160] (adapted from [160])

Another efﬁcient multi-party approach for private comparison and classiﬁcation of

categorical data was recently proposed [80] using a Count-Min sketch data structure

(as described in Sect. 3.4). Sketches are used to summarize records individually by

each database owner, followed by a secure intersection of these sketches to provide a

global synopsis that contains the common records across parties and their frequencies.

The approach uses homomorphic operations, secure summation, and symmetric noise

addition privacy techniques.

Developing privacy-preserving approximate string comparison functions for mul-

tiple (more than two) values has only recently been considered [160]. This MP-

PPRL approach adapts Lai et al.’s Bloom ﬁlter-based exact matching approach [104]

(as described above) for approximate matching to distributively calculate the Dice-

coefﬁcient similarity of a set of Bloom ﬁlters from different parties using a secure

summation protocol. This approach is illustrated in Fig. 9. The Dice-coefﬁcient of P

Bloom ﬁlters (b1,...,bP) is calculated as:

Dice_sim(b1,...,bP)=P×c

P

i=1xi

=P×P

i=1ci

P

i=1xi

,(5)

where ciis the number of common bit positions that are set to 1 in ith Bloom ﬁlter

segment from all Pparties such that c=P

i=1ci, and xiis the number of bit positions

set to 1 in bi(1-bits), where x=P

i=1xiand 1 ≤i≤P.

Similar to Lai et al.’s approach [104], the Bloom ﬁlters are split into segments

such that each party receives a certain segment of the Bloom ﬁlters from all other

parties. A logical conjunction is applied to calculate ciindividually by each party Pi

(with 1 ≤i≤P) which are then summed to calculate cusing a secure summation

protocol. A secure summation of xiis also performed to calculate x. These two sums

are then used to calculate the Dice-coefﬁcient similarity of the Bloom ﬁlters using

Eq. 5. A limitation of this approach is that it can only be used to link a small number

of databases due to its large number of logical conjunction calculations (even when

a private blocking technique is used).

882 D. Vatsalan et al.

Therefore, more work needs to be done in multi-party private comparison and clas-

siﬁcation to enable efﬁcient and effective PPRL on multiple large databases including

sub-set matching (i.e. identifying matching records across sub-set of parties).

6 Open Challenges

In this section we ﬁrst describe the various open challenges of PPRL, and then discuss

these challenges in the context of the four V’s volume,variety,velocity, and veracity

of Big Data.

6.1 Improving Scalability

The trend of Big Data growth dispersed in multiple sources challenges PPRL in terms

of complexity (volume), which increases exponentially with multiple large databases.

Much research in recent years has focused on improving the scalability of the PPRL

process, both with regard to the sizes of the databases to be linked, as well as with

the number of databases to be linked. While signiﬁcant progress has been made in

both these directions, further efforts are required to make all aspects of the PPRL

process scalable. Both directions are highly relevant for Big Data applications.

Even small blocks can still lead to a large number of record pair (or set) com-

parisons that are required in the comparison step, especially when databases from

multiple (more than two) sources are to be linked. For each set of blocks across sev-

eral parties, potentially all combinations of record sets need to be compared. For a

block that contains Brecords from each of Pparties, BPcomparisons are required.

Crucial are efﬁcient adaptive comparison techniques that stop the comparison of

records across parties once a pair of records has been classiﬁed to be a non-match

between two parties. For example, assume the record set rA,rB,rC,rD, where rA

is from party A,rBis from party B, and so on. Once the pair rAand rBare compared

and classiﬁed as a non-match, there is no need to compare all other possible record

pairs (rAwith rC,rAwith rD,rBwith rC, and so on) if the aim of the linkage is to

identify sets of records that match across all parties involved in a PPRL.

A very challenging aspect is the task of identifying sub-sets of records that match

across only a sub-set of parties. An example is to ﬁnd all patients that have medical

records in the databases of any three out of a group of ﬁve hospitals. In this situa-

tion, all potential sub-sets of records need to be compared and classiﬁed. This is a

challenging problem with regard to the number of comparisons required and has not

been studied in the literature so far.

Privacy-Preserving Record Linkage for Big Data: Current Approaches … 883

6.2 Improving Linkage Quality

The veracity and variety aspects (errors and variations) of Big Data need to be

addressed in PPRL by developing accurate and effective comparison and classiﬁca-

tion techniques for high linkage quality. How to efﬁciently calculate the similarity of

more than two values using approximate comparison functions in PPRL is an impor-

tant challenge with multi-source linking. Most existing PPRL solutions for multiple

parties only support exact matching [80,104] or they are applicable to QIDs of

only categorical data [75,118]. Thus far only one recent approach supports approx-

imate matching of string data for PPRL on multiple databases [160] (as described in

Sect. 5.2).

In the area of non-PPRL, advanced collective [13] and graph-based [58,74]

classiﬁcation techniques have been developed in recent times. These techniques are

able to achieve high linkage quality compared to the basic pair-wise comparison

and threshold-based classiﬁcation approach that is often employed in most PPRL

techniques. Group linkage [123] is the only advanced classiﬁcation technique that

has so far been considered for PPRL [105].

For classiﬁcation techniques that require training data (i.e. supervised classiﬁers),

a major challenge in PPRL is how such training data can be generated. Because of

privacy and conﬁdentiality concerns, in PPRL it is generally not possible to gain

access to the actual sensitive QID values (to decide if they refer to a true match or

a true non-match). The advantage of certain collective and graph-based approaches

[13,74] is that they are unsupervised and therefore do not require training data.

However, their disadvantage is their high computational complexities (quadratic or

even higher) [137]. Investigating and adapting advanced classiﬁcation techniques

for PPRL will be a crucial step towards making PPRL useful for practical Big Data

applications, where training data are commonly not available, or are expensive to

generate.

6.3 Dynamic Data and Real-Time Matching

All PPRL techniques developed so far, in line with most non-PPRL techniques, only

consider the batch linkage of static databases. However, a major aspect of Big Data

is the dynamic nature of data (velocity) that requires adaptive systems to link data

as they arrive at an organization, ideally in (near) real-time. Limited work has so

far investigated temporal data [33,107] and real-time [32,70,129] matching in

the context of record linkage. Temporal aspects can be considered by adapting the

similarities between records depending upon the time difference between them, while

real-time matching can be achieved using sophisticated adaptive indexing techniques.

Several works have been done on dynamic privacy-preserving data publishing on

the cloud by developing an efﬁcient and adaptive QID index-based approach over

incremental datasets [175,176].

884 D. Vatsalan et al.

Linking dynamic databases in a PPRL context opens various challenging research

questions. Existing masking (encoding) methods used in PPRL assume static data-

bases that allow parameter settings to be calculated a-priori leading to secure masking

of QID values. For example, Bloom ﬁlters in average should have 50% of their bits

set to 1, making frequency attacks more difﬁcult [117]. Such masking might not

stay secure as the characteristics of data are changing over time. Dynamic databases

also require novel comparison functions that can adapt to changing data as well as

adaptive masking techniques.

6.4 Improving Security and Privacy

In addition to the four V’s of Big Data, another challenging aspect that needs to

be considered for Big Data applications is security and privacy. As we discussed

in Sect. 3.2, most work in PPRL assumes the honest-but-curious (HBC) adversary

model [65,111]. Most PPRL protocols also assume that the parties do not collude

with each other (i.e. a sub-set of two or more parties do not collaborate with the

aim to learn sensitive information of another party) [111]. However, in a commercial

environment and in PPRL scenarios where many parties are involved, such as is

likely in Big Data applications, collusion is a real possibility that needs to be pre-

vented. Only few PPRL techniques consider the malicious adversary model [164].

The techniques developed based on this security model commonly have high compu-

tational complexities and are therefore currently not practical for the linkage of large

databases. Therefore, because the HBC model might not be strong enough while

the malicious model is computationally too expensive, novel security models that

lie between those two need to be investigated for PPRL. Two of these are the covert

adversary model [4] and accountable computing [71], which have been discussed in

Sect. 3.2. Research directions are required to develop new protocols that are practical

and at the same time more secure than protocols based on the HBC model.

With regard to privacy, most PPRL techniques are known to leak some information

during the exchange of data between the parties (such as the number and sizes of

blocks, or the similarities between compared records). How sensitive such revealed

information is for a certain dataset heavily depends upon the parameter settings used

by a protocol. Sophisticated attack methods [101] have been developed that exploit

the subtle pieces of information revealed by certain PPRL protocols to iteratively

gather information about sensitive values. Therefore, there is a need to harden existing

PPRL techniques to ensure they are not vulnerable to such attacks. Preserving privacy

of individual entities is more challenging with multi-party PPRL due to the increasing

risk of collusion between a sub-set of parties which aim to learn about another (sub-set

of) party’s private data. Distributing computations among pairs or groups of parties

can reduce the likelihood of collusion between parties if individual pairs or groups

can use different secret keys (known only to them) for masking their values.

Most PPRL techniques have mainly been focusing on the privacy of the individual

records that are to be linked [165]. However, besides individual record privacy, the

Privacy-Preserving Record Linkage for Big Data: Current Approaches … 885

privacy of a group of individuals also needs to be considered. Often the outcomes of

a PPRL project are sets of linked records that represent people with certain charac-

teristics (such as certain illnesses, or particular ﬁnancial circumstances). While the

names, addresses and other personal details of these people are not revealed during or

after the PPRL process, their overall characteristics as a group could potentially lead

to the discrimination of individuals in this group if these characteristics are being

revealed. The research areas of privacy-preserving data publishing [59] and statistical

conﬁdentiality [45] have been addressing these issues from different directions.

PPRL is only one component in the management and analysis of sensitive, person-

related information by linking different datasets in a privacy-preserving manner.

However, achieving an effective overall privacy preservation needs a comprehensive

strategy regarding the whole data life cycle including collection, management, pub-

lishing, exchange and analysis of data to be protected (‘privacy-by-design’) [22].

Hence, it is necessary to better understand the role of PPRL in the life cycle for

sensitive data to ensure that it can be applied and that the match results are both

useful and privacy-preserving.

In research, the different technical aspects to preserve privacy have partially been

addressed by different communities with little interaction. For example, there is a

large body of research on privacy-preserving data publishing [59] and on privacy-

preserving data mining [109,156] that have been largely decoupled from the research

on PPRL. It is well known that data analysis may identify individuals despite the

masking of QID values [152]. Hence, there is similar risk that the combined infor-

mation of matched records together with some background information could lead

to the identiﬁcation of individuals (known as re-identiﬁcation). Such risks must be

evaluated and addressed within a comprehensive privacy strategy including a closely

aligned PPRL and privacy-preserving data analysis/mining approach.

6.5 Evaluation, Frameworks, and Benchmarks

How to assess the quality (how many classiﬁed matches are true matches) and com-

pleteness (how many true matches have been classiﬁed as matches) of the records

linked in a PPRL project is very challenging because it is generally not possible

to inspect linked records due to privacy concerns. Manual assessment of individual

records would reveal sensitive information which is in contradiction to the objective

of PPRL. Not knowing how accurate and complete linked data are is however a major

issue that will render any PPRL protocol impractical in applications where linkage

completeness and quality are crucial, as is the case in many Big Data applications

such as in the health or security domains.

Recent initial work has proposed ideas and concepts for interactive PPRL [100]

where parts of sensitive values are revealed for manual assessment. How to actually

implement such approaches in real applications, while ensuring the revealed infor-

mation is limited to a certain level of detail (for example providing k-anonymous

privacy for a certain value of k>1[152]) is an open research question that must be

886 D. Vatsalan et al.

solved. Interactive manual evaluation might also not be feasible in Big Data appli-

cations where the size and dynamic nature of data, as well as real-time processing

requirements, prohibit any manual inspection.

With regard to evaluating the privacy protection that a given PPRL technique

provides, unlike for measuring linkage quality and completeness (where standard

measurements such as runtime, reduction ratio, pairs completeness, pairs quality,

precision, recall, or accuracy are available [28]), there are currently no standard

measurements for assessing privacy in PPRL. Different measurements have been

proposed and used [46,164,165], making the comparison of different PPRL tech-

niques difﬁcult. How to assess linkage quality and completeness, as well as privacy,

are must-solve problems as otherwise it will not be possible to evaluate the efﬁciency,

effectiveness, and privacy protection of PPRL techniques in real-world applications,

leaving these techniques non-practical.

An important direction of future work for PPRL is the development of frameworks

that allow the experimental comparison of different PPRL techniques with regard

to their scalability, linkage quality, and privacy preservation. No such framework

currently exists. Ideally, such frameworks allow researchers to easily ‘plug-in’ their

own algorithms such that over time a collection of PPRL algorithms is compiled that

can be tested and evaluated by researchers, as well as by practitioners to allow them

to identify the best technique to use for their application scenario.

An issue related to frameworks is the availability of publicly available benchmark

datasets for PPRL. While this is not a challenge limited to PPRL but to record linkage

research in general [28,95], it is particularly prominent for PPRL as it deals with

sensitive and conﬁdential data. While for record linkage techniques publicly available

data from bibliographic or consumer product databases might be used [95], such data

are less useful for PPRL research as they have different characteristics compared to

personal data. The nature of the datasets to be linked using PPRL techniques is

obviously in strong contradiction to them being made public. Ideally researchers

working in PPRL are able to collaborate with practitioners that do have access to

real sensitive and conﬁdential databases to allow them to evaluate their techniques

on such data.

A possible alternative to using benchmark datasets is the use of synthetic data that

are generated based on the characteristics of real data using data generators [34,153].

Such generators must be able to generate data with similar distribution of values, vari-

ations, and errors as would be expected in real datasets from the same domain. Several

such data generators have been developed and are used by researchers working in

PPRL as well as record linkage in general.

6.6 Discussion

As we have discussed in this section, there are various challenges that need to be

addressed in order to make PPRL practical for applications in a variety of domains.

Privacy-Preserving Record Linkage for Big Data: Current Approaches … 887

Some of these challenges are general and not just affect PPRL for Big Data, others

are speciﬁc to certain types of applications, including those in the Big Data space.

The challenge of scalability of PPRL towards very large databases is highly rel-

evant to the volume of Big Data, while the challenge of linkage quality of PPRL

is highly relevant to the veracity and variety of Big Data. The dynamic nature of

data in many Big Data applications, and the requirement of being able to link data

in real-time, are challenging all aspects of PPRL, as well as record linkage in gen-

eral [129]. This challenge corresponds to the velocity of Big Data and it requires the

development of novel techniques that are adaptive to changing data characteristics,

and that are highly efﬁcient with regard to fast linking of streams of query records.

While the volume,variety, and veracity aspects of Big Data have been studied for

PPRL to some extent, the velocity aspect has so far not been addressed in a PPRL

context.

Making PPRL more secure and more private is challenged by all four V’s of Big

Data. Larger data volume likely means that only encoding techniques that require

little computational efforts per record can be employed, while dynamic data (velocity)

means such techniques have to be adaptable to changing data characteristics. Var i e t y

means PPRL techniques have to be made more secure and private for various types

of data, while veracity requires them to also take data uncertainties into account. The

challenge of integrating PPRL into an overall privacy-preserving approach has also

not seen any work so far. All four V’s of Big Data will affect the overall efﬁciency and

effectiveness of systems that enable the management and analysis of sensitive and

conﬁdential information in a privacy-preserving manner. The more basic challenges

of improving scalability, linkage quality, privacy and evaluation need to solved ﬁrst

before this more complex challenge of an overall privacy-preserving system can be

addressed.

The ﬁnal challenge of evaluation is affected by all aspects of Big Data. Improved

evaluation of PPRL systems requires that databases that are large, heterogeneous,

dynamic, and that contain uncertain data, can be handled and evaluated efﬁciently

and accurately. So far no research in PPRL has investigated evaluation speciﬁcally

for Big Data. While the lack of general benchmarks and frameworks is already a gap

in PPRL and record linkage research in general, Big Data will make this challenge

even more pronounced. Compared to frameworks that can handle small and medium

sized static datasets only, it is even more difﬁcult to develop frameworks that enable

privacy-preserving linking of very large and dynamic databases, as is making such

datasets publicly available. No work addressing this challenge in the context of Big

Data has been published.

7 Conclusions

Privacy-preserving record linkage (PPRL) is an emerging research ﬁeld that is being

required by many different applications to enable effective and efﬁcient linkage of

databases across different organizations without compromising privacy and conﬁden-

tiality of the entities in these databases. In the Big Data era, tremendous opportunities

888 D. Vatsalan et al.

can be realized by linking data at the cost of additional challenges. In this chapter, we

have provided background material required to understand the applications, process,

and challenges of PPRL, and we have reviewed existing PPRL approaches to under-

stand the literature. Based on the analysis of existing techniques, we have discussed

several interesting and challenging directions for future work in PPRL for Big Data.

With the increasing trend of Big Data in organizations, more research is required

towards the development of techniques that allow for multiple large databases to be

linked in privacy-preserving, effective, and efﬁcient ways, thereby facilitating novel

ways of data analysis and mining that currently are not feasible due to scalability,

quality, and privacy-preserving challenges.

Acknowledgements This work was partially funded by the Australian Research Council under

Discovery Project DP130101801, the German Academic Exchange Service (DAAD) and Universi-

ties Australia (UA) under the Joint Research Co-operation Scheme, and also funded by the German

Federal Ministry of Education and Research within the project Competence Center for Scalable

Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF 01IS14014B).

References

1. R. Agrawal, A. Evﬁmievski, R. Srikant, Information sharing across private databases, in ACM

SIGMOD (2003), pp. 86–97

2. A. Arasu, V. Ganti, R. Kaushik, Efﬁcient exact set-similarity joins, in PVLDB (2006), pp.

918–929

3. A. Arasu, M. Götz, R. Kaushik, On active learning of record matching packages, in ACM

SIGMOD (2010), pp. 783–794

4. Y. Aumann, Y. Lindell, Security against covert adversaries: efﬁcient protocols for realistic

adversaries. J. Cryptol. 23(2), 281–343 (2010)

5. T. Bachteler, J. Reiher, and R. Schnell. Similarity Filtering with Multibit Trees for Record

Linkage. Technical Report WP-GRLC-2013-01, German Record Linkage Center, 2013

6. D. Barone, A. Maurino, F. Stella, C. Batini, A privacy-preserving framework for accuracy

and completeness quality assessment, in Emerging Paradigms in Informatics, Systems and

Communication (2009), pp. 83–87

7. J.E. Barros, J.C. French, W.N. Martin, P.M. Kelly, T.M. Cannon, Using the triangle inequality

to reduce the number of comparisons required for similarity-based retrieval, in Electronic

Imaging Science and Technology (1996), pp. 392–403

8. C. Batini, M. Scannapieca, Data quality: Concepts, Methodologies And Techniques. Data-

Centric Systems and Applications (Springer, Berlin, 2006)

9. R. Baxter, P. Christen, T. Churches, A comparison of fast blocking methods for record linkage,

in SIGKDD Workshop on Data Cleaning, Record Linkage and Object Consolidation (2003),

pp. 25–27

10. R.J. Bayardo, Y. Ma, R. Srikant, Scaling Up All Pairs Similarity Search, in WWW (2007), pp.

131–140

11. K. Bellare, S. Iyengar, A.G. Parameswaran, V. Rastogi, Active sampling for entity matching,

in ACM SIGKDD (2012), pp. 1131–1139

12. A. Berman, L.G. Shapiro, Selecting good keys for triangle-inequality-based pruning algo-

rithms, in IEEE Workshop on Content-Based Access of Image and Video Database (1998),

pp. 12–19

13. I. Bhattacharya, L. Getoor, Collective entity resolution in relational data. ACM TKDD 1(1),

1–35 (2007)

Privacy-Preserving Record Linkage for Big Data: Current Approaches … 889

14. M. Bilenko, R.J. Mooney, Adaptive duplicate detection using learnable string similarity mea-

sures, in ACM SIGKDD (2003), pp. 39–48

15. B. Bloom, Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7),

422–426 (1970)

16. L. Bonomi, L. Xiong, R. Chen, B. Fung, Frequent grams based embedding for privacy pre-

serving record linkage, in ACM CIKM (2012), pp. 1597–1601

17. H. Bouzelat, C. Quantin, L. Dusserre, Extraction and anonymity protocol of medical ﬁle, in

AMIA Fall Symposium (1996), pp. 323–327

18. A.Z. Broder, On the resemblance and containment of documents, in Compression and Com-

plexity of Sequences. IEEE (1997), pp. 21–29

19. A. Broder, M. Mitzenmacher, A. Mitzenmacher, Network applications of Bloom ﬁlters: a

survey. Internet Math. 1(4), 485–509 (2004)

20. E. Brook, D. Rosman, C. Holman, Public good through data linkage: measuring research

outputs from the Western Australian data linkage system. Aust. NZ J. Public Health 32,

19–23 (2008)

21. R. Canetti, Security and composition of multiparty cryptographic protocols. J. Cryptol. 13(1),

143–202 (2000)

22. A. Cavoukian, J. Jonas, Privacy by design in the age of Big Data. Technical report, TR

Information and privacy commissioner, Ontario (2012)

23. P. Christen, A comparison of personal name matching: techniques and practical issues, in

IEEE ICDM Workshop on Mining Complex Data (2006), pp. 290–294

24. P. Christen, Privacy-preserving data linkage and geocoding: current approaches and research

directions, in IEEE ICDM Workshop on Privacy Aspects of Data Mining (2006), pp. 497–501

25. P. Christen, Automatic record linkage using seeded nearest neighbour and support vector

machine classiﬁcation, in ACM SIGKDD (2008), pp. 151–159

26. P. Christen, Febrl: an open source data cleaning, deduplication and record linkage system

with a graphical user interface, in ACM SIG KDD (2008), pp. 1065–1068

27. P. Christen, Geocode matching and privacy preservation, in Workshop on Privacy, Security,

and Trust in KDD (Springer, Berlin, 2009), pp. 7–24

28. P. Christen, Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution,

and Duplicate Detection (Springer, Berlin, 2012)

29. P. Christen, A survey of indexing techniques for scalable record linkage and deduplication.

IEEE TKDE 24(9), 1537–1555 (2012)

30. P. Christen, T. Churches, M. Hegland, Febrl – a parallel open source data linkage system, in

Springer PAKDD (2004), pp. 638–647

31. P. Christen, K. Goiser, Quality and complexity measures for data linkage and deduplication,

in Quality Measures in Data Mining, vol. 43. Studies in Computational Intelligence (Springer,

Berlin, 2007), pp. 127–151

32. P. Christen, R. Gayler, D. Hawking, Similarity-aware indexing for real-time entity resolution,

in ACM CIKM (2009), pp. 1565–1568

33. P. Christen, R.W. Gayler, Adaptive temporal entity resolution on dynamic databases, in

PAKD D (2013), pp. 558–569

34. P. Christen, D. Vatsalan, Flexible and extensible generation and corruption of personal data,

in ACM CIKM (2013), pp. 1165–1168

35. T. Churches, P. Christen, Some methods for blindfolded record linkage. BioMed Cent. Med.

Inf. Decision Mak. 4(9), (2004)

36. T. Churches, P. Christen, K. Lim, J.X. Zhu, Preparation of name and address data for record

linkage using hidden Markov models. BioMed Cent. Med. Inf. Decision Mak. 2(9), (2002)

37. D.E. Clark, Practical introduction to record linkage for injury research. Inj. Prev. 10, 186–191

(2004)

38. C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, M. Zhu, Tools for privacypreserving distributed

data mining. SIGKDD Explor. 4(2), 28–34 (2002)

39. W.W. Cohen, Data integration using similarity joins and a word-based information represen-

tation language. ACM TOIS 18(3), 288–321 (2000)

890 D. Vatsalan et al.

40. W.W. Cohen, J. Richman, Learning to match and cluster large high-dimensional data sets for

data integration, in ACM SIGKDD (2002), pp. 475–480

41. G. Cormode, S. Muthukrishnan, An improved data stream summary: the count-min sketch

and its applications. J. Algorithms 55(1), 58–75 (2005)

42. G. Dal Bianco, R. Galante, C.A. Heuser, A fast approach for parallel deduplication on multi-

core processors, in ACM Symposium on Applied Computing (2011), pp. 1027–1032

43. D. Dey, V. Mookerjee, D. Liu, Efﬁcient techniques for online record linkage. IEEE TKDE

23(3), 373–387 (2010)

44. W. Du, M. Atallah, Protocols for secure remote database access with approximate matching,

in ACM WSPEC (Springer, Berlin, 2000), pp. 87–111

45. G.T. Duncan, M. Elliot, J.-J. Salazar-González, Statistical Conﬁdentiality: Principles and

Practice (Springer, New York, 2011)

46. E. Durham, A framework for accurate, efﬁcient private record linkage. Ph.D. thesis, Faculty

of the Graduate School of Vanderbilt University, Nashville, TN, 2012

47. E. Durham, Y. Xue, M. Kantarcioglu, B. Malin, Private medical record linkage with approx-

imate matching, in AMIA Annual Symposium (2010), pp. 182–186

48. E.A. Durham, C. Toth, M. Kuzu, M. Kantarcioglu, Y. Xue, B. Malin, Composite Bloom ﬁlters

for secure record linkage. IEEE TKDE 26(12), pp. 2956–2968 (2013)

49. L. Dusserre, C. Quantin, H. Bouzelat, A one way public key cryptosystem for the linkage of

nominal ﬁles in epidemiological studies. Medinfo 8, 644–647 (1995)

50. C. Dwork, Differential privacy, in ICALP (2006), pp. 1–12

51. M.G. Elfeky, V.S. Verykios, A.K. Elmagarmid, TAILOR: a record linkage toolbox, in IEEE

ICDE (2002), pp. 17–28

52. A. Elmagarmid, P. Ipeirotis, V.S. Verykios, Duplicate record detection: a survey. IEEE TKDE

19(1), 1–16 (2007)

53. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, Advances in Knowledge Discovery

and Data Mining (The MIT Press, Cambridge, 1996)

54. I.P. Fellegi, A.B. Sunter, A theory for record linkage. J. Am. Stat. Soc. 64(328), 1183–1210

(1969)

55. S.E. Fienberg, Conﬁdentiality and disclosure limitation. Encycl. Soc. Meas. 1, 463–469 (2005)

56. B. Forchhammer, T. Papenbrock, T. Stening, S. Viehmeier, U. Draisbach, F. Naumann, Dupli-

cate detection on GPUs, in BTW (2013), pp. 165–184

57. M. Freedman, Y. Ishai, B. Pinkas, O. Reingold, Keyword search and oblivious pseudorandom

functions, in Theory of Cryptography (2005), pp. 303–324

58. Z. Fu, J. Zhou, P. Christen, M. Boot, Multiple instance learning for group record linkage, in

PAKDD, Springer LNAI (2012), pp. 171–182

59. B. Fung, K. Wang, R. Chen, P.S. Yu, Privacy-preserving data publishing: a survey of recent

developments. ACM Comput. Surv. 42(4), 14 (2010)

60. S.R. Ganta, S.P. Kasiviswanathan, A. Smith, Composition attacks and auxiliary information

in data privacy, in ACM SIGKDD (2008), pp. 265–273

61. A. Gionis, P. Indyk, R. Motwani, Similarity search in high dimensions via hashing, in VLDB

(1999), pp. 518–529

62. O. Goldreich, Foundations of Cryptography: Basic Applications, vol. 2. (Cambridge Univer-

sity Press, Cambridge, 2004)

63. L. Gu, R. Baxter, Decision models for record linkage, in Selected Papers from AusDM.LNCS,

vol. 3755 (Springer, Berlin, 2006), pp. 146–160

64. M. Hadjieleftheriou, A. Chandel, N. Koudas, D. Srivastava, Fast indexes and algorithms for

set similarity selection queries, in IEEE ICDE (2008), pp. 267–276

65. R. Hall, S. Fienberg, Privacy-preserving record linkage, in PSD (2010), pp. 269–283

66. M. Herschel, F. Naumann, S. Szott, M. Taubert, Scalable iterative graph duplicate detection.

IEEE TKDE 24(11), 2094–2108 (2012)

67. A. Inan, M. Kantarcioglu, E. Bertino, M. Scannapieco, A hybrid approach to private record

linkage, in IEEE ICDE (2008), pp. 496–505

Privacy-Preserving Record Linkage for Big Data: Current Approaches … 891

68. A. Inan, M. Kantarcioglu, G. Ghinita, E. Bertino. Private record matching using differential

privacy, in EDBT (2010), pp. 123–134

69. P. Indyk, R. Motwani, Approximate nearest neighbors: Towards removing the curse of dimen-

sionality,inACM Symposium on the Theory of Computing (1998), pp. 604–613

70. E. Ioannou, W. Nejdl, C. Niederée, Y. Velegrakis, On-the-ﬂy entity-aware query processing

in the presence of linkage. PVLDB 3(1–2), 429–438 (2010)

71. W. Jiang, C. Clifton, Ac-framework for privacy-preserving collaboration, in SDM SIAM

(2007), pp. 47–56

72. W. Jiang, C. Clifton, M. Kantarcıo˘glu, Transforming semi-honest protocols to ensure account-

ability. Elsevier DKE 65(1), 57–74 (2008)

73. J. Jonas, J. Harper, Effective counterterrorism and the limited role of predictive data mining.

Policy Anal. 584, 1–12 (2006)

74. D. Kalashnikov, S. Mehrotra, Domain-independent data cleaning via analysis of entity-

relationship graph. ACM TODS 31(2), 716–767 (2006)

75. M. Kantarcioglu, W. Jiang, B. Malin, A privacy-preserving framework for integrating person-

speciﬁc databases, in PSD (2008), pp. 298–314

76. A. Karakasidis, V.S. Verykios, Secure blocking+secure matching =secure record linkage.

JCSE 5, 223–235 (2011)

77. A. Karakasidis, V.S. Verykios, Reference table based k-anonymous private blocking, in ACM

SAC (2012), pp. 859–864

78. A. Karakasidis, V.S. Verykios, A sorted neighborhood approach to multidimensional privacy

preserving blocking, in IEEE ICDMW (2012), pp. 937–944

79. A. Karakasidis, V.S. Verykios, P.Christen, Fake injection strategies for private phonetic match-

ing. DPM Springer 7122, 9–24 (2012)

80. D. Karapiperis, D. Vatsalan, V.S. Verykios, P. Christen, Large-scale multi-party counting set

intersection using a space efﬁcient global synopsis, in DASFAA (2015), pp. 329–345

81. D. Karapiperis, D. Vatsalan, V.S. Verykios, P. Christen, Efﬁcient record linkage using a com-

pact hamming space, in EDBT (2016), pp. 209–220

82. D. Karapiperis, V.S. Verykios, A distributed framework for scaling up LSH-based computa-

tions in privacy preserving record linkage, in ACM BCI (2013), pp. 102–109

83. D. Karapiperis, V.S. Verykios, A distributed near-optimal LSH-based framework for privacy-

preserving record linkage. ComSIS 11(2), 745–763 (2014)

84. D. Karapiperis, V.S. Verykios, An LSH-based blocking approach with a homomorphic match-

ing technique for privacy-preserving record linkage. IEEE TKDE 27(4), 909–921 (2015)

85. D. Karapiperis, V.S. Verykios, A fast and efﬁcient hamming LSH-based scheme for accurate

linkage, in Springer KAIS (2016), pp. 1–24

86. H. Kargupta, S. Datta, Q. Wang, K. Sivakumar, On the privacypreserving properties of random

data perturbation techniques, in IEEE ICDM (2003), p. 99

87. H. Kargupta, S. Datta, Q. Wang, K. Sivakumar, Random-data perturbation techniques and

privacy-preserving data mining, Springer KAIS 7(4), 387–414 (2005)

88. C.W. Kelman, J. Bass, D. Holman, Research use of linked health data - a best practice protocol.

Aust.NZJ.PublicHealth26, 251–255 (2002)

89. H. Kim, D. Lee, Harra: fast iterative hashed record linkage for large-scale data collections, in

EDBT (2010), pp. 525–536

90. H.-s. Kim, D. Lee, Parallel linkage, in ACM CIKM (2007), pp. 283–292

91. T. Kirsten, L. Kolb, M. Hartung, A. Groß, H. Köpcke, E. Rahm, Data partitioning for parallel

entity matching, in QDB (2010)

92. L. Kissner, D. Song, Private and threshold set-intersection, in Technical Report. Carnegie

Mellon University, 2004

93. L. Kolb, A. Thor, E. Rahm, Dedoop: efﬁcient deduplication with Hadoop. PVLDB 5(12),

1878–1881 (2012)

94. L. Kolb, A. Thor, E. Rahm, Load balancing for mapreduce-based entity resolution, in IEEE

ICDE (2012), pp. 618–629

892 D. Vatsalan et al.

95. H. Köpcke, E. Rahm, Frameworks for entity matching: a comparison. Elsevier DKE 69(2),

197–210 (2010)

96. H. Köpcke, A. Thor, E. Rahm, Evaluation of entity resolution approaches on real-world match

problems. PVLDB 3(1), 484–493 (2010)

97. H. Krawczyk, M. Bellare, R. Canetti, HMAC: keyed-hashing for message authentication, in

Internet RFCs (1997)

98. T.G. Kristensen, J. Nielsen, C.N. Pedersen, A tree-based method for the rapid screening of

chemical ﬁngerprints. Algorithms Mol. Biol. 5(1), 9 (2010)

99. H. Kum, A. Krishnamurthy, A. Machanavajjhala, S. Ahalt, Population informatics: tapping

the social genome to advance society: a vision for putting “big data” to work for population

informatics. Computer (2013)

100. H.-C. Kum, A. Krishnamurthy, A. Machanavajjhala, M.K. Reiter,S. Ahalt, Privacy preserving

interactive record linkage. JAMIA 21(2), 212–220 (2014)

101. M. Kuzu, M. Kantarcioglu, E. Durham, B. Malin, A constraint satisfaction cryptanalysis of

Bloom ﬁlters in private record linkage. PETS Springer LNCS 6794, 226–245 (2011)

102. M. Kuzu, M. Kantarcioglu, E.A. Durham, C. Toth, B. Malin, A practical approach to achieve

private medical record linkage in light of public resources. JAMIA 20(2), 285–292 (2013)

103. M. Kuzu, M. Kantarcioglu, A. Inan, E. Bertino, E. Durham, B. Malin, Efﬁcient privacy-aware

record integration, in ACM EDBT (2013), pp. 167–178

104. P. Lai, S. Yiu, K. Chow, C. Chong, L. Hui, An efﬁcient Bloom ﬁlter based solution for

multiparty private matching, in SAM (2006)

105. F. Li, Y. Chen, B. Luo, D. Lee, P. Liu, Privacy preserving group linkage, in Scientiﬁc and

Statistical Database Management (Springer, Berlin, 2011), pp. 432–450

106. N. Li, T. Li, S. Venkatasubramanian, T-closeness: privacy beyondk-anonymity and l-diversity,

in IEEE ICDE (2007), pp. 106–115

107. P. Li, X. Dong, A. Maurino, D. Srivastava, Linking temporal records. PVLDB 4(11), 956–967

(2011)

108. Z. Lin, M. Hewett, R.B. Altman, Using binning to maintain conﬁdentiality of medical data,

in AMIA Symposium (2002), p. 454

109. Y. Lindell, B. Pinkas, Privacy preserving data mining, in CRYPTO (Springer, Berlin, 2000),

pp. 36–54

110. Y. Lindell, B. Pinkas, An efﬁcient protocol for secure two-party computation in the presence

of malicious adversaries, in EUROCRYPT (2007), pp. 52–78

111. Y. Lindell, B. Pinkas, Secure multiparty computation for privacy-preserving data mining. JPC

1(1), 5 (2009), pp. 59–98

112. H. Liu, H. Wang, Y. Chen, Ensuring data storage security against frequency-based attacks in

wireless networks, in DCOSS, Springer LNCS, vol. 6131 (2010), pp. 201–215

113. H. Lu, M.-C. Shan, K.-L. Tan, Optimization of multi-way join queries for parallel execution,

in VLDB (1991), pp. 549–560

114. M. Luby, C. Rackoff, How to construct pseudo-random permutations from pseudo-random

functions, in CRYPTO, vol. 85 (1986), p. 447

115. A. Machanavajjhala, D. Kifer, J. Gehrke, M. Venkitasubramaniam, l-diversity: privacy beyond

k-anonymity. ACM TKDD 1(1), 3 (2007)

116. B.A. Malin, K. El Emam, C.M. O’Keefe, Biomedical data privacy: problems, perspectives,

and recent advances. JAMIA 20(1), 2–6 (2013)

117. M. Mitzenmacher, E. Upfal, Probability and Computing: Randomized Algorithms and Prob-

abilistic Analysis (Cambridge University Press, Cambridge, 2005)

118. N. Mohammed, B. Fung, M. Debbabi, Anonymity meets game theory: secure data integration

with malicious participants. PVLDB 20(4), 567–588 (2011)

119. M. Nentwig, M. Hartung, A.-C. Ngonga Ngomo, E. Rahm, A survey of current link discovery

frameworks. Semantic Web Journal (2016)

120. A.N. Ngomo, L. Kolb, N. Heino, M. Hartung, S. Auer, E. Rahm, When to reach for the cloud:

using parallel hardware for link discovery, in ESWC (2013), pp. 275–289

121. Ofﬁce for National Statistics, Beyond 2011 matching anonymous data (2013)

Privacy-Preserving Record Linkage for Big Data: Current Approaches … 893

122. C. O’Keefe, M. Yung, L. Gu, R. Baxter, Privacy-preserving data linkage protocols, in ACM

WPES (2004), pp. 94–102

123. B. On, N. Koudas, D. Lee, D. Srivastava, Group linkage, in IEEE ICDE (2007), pp. 496–505

124. C. Pang, L. Gu, D. Hansen, A. Maeder, Privacy-preserving fuzzy matching using a pub-

lic reference table, in Intelligent Patient Management, vol. 189. Studies in Computational

Intelligence (Springer, Berlin, 2009), pp. 71–89

125. C. Phua, K. Smith-Miles, V. Lee, R. Gayler, Resilient identity crime detection. IEEE TKDE

24(3), 533–546 (2012)

126. C. Quantin, H. Bouzelat, L. Dusserre, Irreversible encryption method by generation of poly-

nomials. Med. Inf. Internet Med. 21(2), 113–121 (1996)

127. C. Quantin, H. Bouzelat, F. Allaert, A. Benhamiche, J. Faivre, L. Dusserre, How to ensure

data security of an epidemiological follow-up: quality assessment of an anonymous record

linkage procedure. IJMI 49(1), 117–122 (1998)

128. E. Rahm, H.H. Do, Data cleaning: problems and current approaches. IEEE Data Eng. Bull.

23(4), 3–13 (2000)

129. B. Ramadan, P. Christen, H. Liang, R.W. Gayler, Dynamic sorted neighborhood indexing for

real-time entity resolution. ACM JDIQ 6(4), 15 (2015)

130. T. Ranbaduge, P. Christen, D. Vatsalan, Tree based scalable indexing for multi-party privacy-

preserving record linkage, in AusDM (2014)

131. T. Ranbaduge, D. Vatsalan, P. Christen, Clustering-based scalable indexing for multi-party

privacy-preserving record linkage, in Springer PAKDD (2015), pp. 549–561

132. T. Ranbaduge, D. Vatsalan, P. Christen, Merlin–a tool for multi-party privacy-preserving

record linkage, in IEEE ICDMW (2015), pp. 1640–1643

133. T. Ranbaduge, D. Vatsalan, P. Christen, Hashing-based distributed multi-party blocking for

privacy-preserving record linkage, in Springer PAKDD (2016), pp. 415–427

134. T. Ranbaduge, D. Vatsalan, S. Randall, P. Christen, Evaluation of advanced techniques for

multi-party privacy-preserving record linkage on real-world health databases, in IPDLN

(2016)

135. S.M. Randall, A.M. Ferrante, J.H. Boyd, J.B. Semmens, Privacy-preserving record linkage

on large real world datasets, in Elsevier JBI (2014) volume 50, pp. 205–212

136. S.M. Randall, A.M. Ferrante, J.H. Boyd, A.P. Brown, J.B. Semmens, Limited privacy pro-

tection and poor sensitivity is it time to move on from the statistical linkage key-581? Health

Inf. Manag. J. 37, 60–62 (2016)

137. V. Rastogi, N. Dalvi, M. Garofalakis, Large-scale collective entity matching. in VLDB 4,

208–218 (2011)

138. C. Rong, W. Lu, X. Wang, X. Du, Y. Chen, A.K.H. Tung, Efﬁcient and scalable processing

of string similarity join. IEEE TKDE 25(10), 2217–2230 (2013)

139. M. Roughan, Y. Zhang, Secure distributed data-mining and its application to large-scale

network measurements. ACM SIGCOMM Comput. Commun. Rev. 36(1), 7–14 (2006)

140. T. Ryan, D. Gibson, B. Holmes, A national minimum data set for home and community care,

in Australian Institute of Health and Welfare (1999)

141. M. Scannapieco, I. Figotin, E. Bertino, A. Elmagarmid, Privacy preserving schema and data

matching, in ACM SIGMOD (2007), pp. 653–664

142. D.A. Schneider, D.J. DeWitt, Tradeoffs in processing complex join queries via hashing in

multiprocessor database machines, in VLDB (1990), pp. 469–480

143. B. Schneier, Applied Cryptography: Protocols, Algorithms, and Source Code in C, 2nd edn.

(Wiley, New York, 1996)

144. R. Schnell, Privacy-preserving record linkage and privacy-preserving blocking for large ﬁles

with cryptographic keys using multibit trees, in JSM (2013), pp. 187–194

145. R. Schnell, An efﬁcient privacy-preserving record linkage technique for administrative data

and censuses. Stat. J. IAOS 30(3), 263–270 (2014)

146. R. Schnell, T. Bachteler, S. Bender, A toolbox for record linkage. Aust. J. Stat. 33(1–2),

125–133 (2004)

894 D. Vatsalan et al.

147. R. Schnell, T. Bachteler, J. Reiher, Privacy-preserving record linkage using Bloom ﬁlters.

BMC Medi. Inf. Decision Mak. 9(1), 41 (2009)

148. R. Schnell, T. Bachteler, J. Reiher, A novel error-tolerant anonymous linking code, in German

Record Linkage Center, WP-GRLC-2011-02 (2011)

149. Z. Sehili, E. Rahm, Speeding up privacy preserving record linkage for metric space similarity

measures, in Datenbank-Spektrum (2016), pp. 1–10

150. Z. Sehili, L. Kolb, C. Borgs, R. Schnell, E. Rahm, Privacy preserving record linkage with PP

Join, in BTW Conference (2015)

151. D. Song, D. Wagner, A. Perrig, Practical techniques for searches on encrypted data, in IEEE

Symposium on Security and Privacy (2000), pp. 44–55

152. L. Sweeney, K-anonymity: a model for protecting privacy. Int. J. Uncertaint. Fuzziness Knowl.

Based Syst. 10(5), 557–570 (2002)

153. K.-N. Tran, D. Vatsalan, P. Christen, GeCo: an online personal data generator and corruptor,

in ACM CIKM (2013), pp. 2473–2476

154. S. Trepetin, Privacy-preserving string comparisons in record linkage systems: a review. Inf.

Secur. J.: A Global Perspect. 17(5), 253–266 (2008)

155. E. Turgay, T. Pedersen, Y. Saygın, E. Sava¸s, A. Levi, Disclosure risks of distance preserving

data transformations, in Springer SSDBM (2008), pp. 79–94

156. J. Vaidya, Y. Zhu, C.W. Clifton, Privacy Preserving Data Mining, vol. 19. Advances in Infor-

mation Security (Springer, Berlin, 2006)

157. E. Van Eycken, K. Haustermans, F. Buntinx et al., Evaluation of the encryption procedure and

record linkage in the Belgian national cancer registry. Archiv. Public Health 58(6), 281–294

(2000)

158. D. Vatsalan, P. Christen, An iterative two-party protocol for scalable privacy-preserving record

linkage, in AusDM, CRPIT (2012), pp. 127–138

159. D. Vatsalan, P. Christen, Sorted nearest neighborhood clustering for efﬁcient private blocking,

in Springer PAKDD, vol. 7819 (2013), pp. 341–352

160. D. Vatsalan, P. Christen, Scalable privacy-preserving record linkage for multiple databases,

in ACM CIKM (2014), pp. 1795–1798

161. D. Vatsalan, P. Christen, Privacy-preserving matching of similar patients. Elsevier JBI 59,

285–298 (2016)

162. D. Vatsalan, P. Christen, V.S.Verykios, An efﬁcient two-partyprotocol for approximate match-

ing in private record linkage, in AusDM (2011), pp. 125–136

163. D. Vatsalan, P. Christen, V.S. Verykios, Efﬁcient two-party private blocking based on sorted

nearest neighborhood clustering, in ACM CIKM (2013), pp. 1949–1958

164. D. Vatsalan, P. Christen, V.S. Verykios, A taxonomy of privacy-preserving record linkage

techniques. Elsevier JIS 38(6), 946–969 (2013)

165. D. Vatsalan, P. Christen, C.M. O’Keefe, V.S. Verykios, An evaluation framework for privacy-

preserving record linkage. JPC 6(1), 3 (2014), pp. 35–75

166. R. Vernica, M.J. Carey, C. Li, Efﬁcient parallel set-similarity joins using MapReduce, in ACM

SIGMOD (2010), pp. 495–506

167. V.S. Verykios, A. Karakasidis, V. Mitrogiannis, Privacy preserving record linkage approaches.

IJDMMM 1(2), 206–221 (2009)

168. G. Wang, H. Chen, H. Atabakhsh, Automatically detecting deceptive criminal identities.

Commun. ACM 47(3), 70–76 (2004)

169. Q. Wang, D. Vatsalan, P. Christen, Efﬁcient interactive training selection for large-scale entity

resolution, in PAKDD (2015), pp. 562–573

170. Z. Wen, C. Dong, Efﬁcient protocols for privaterecord linkage, in ACM Symposium on Applied

Computing (2014), pp. 1688–1694

171. W.E. Winkler, Methods for evaluating and creating data quality. Elsevier JIS 29(7), 531–550

(2004)

172. C. Xiao, W. Wang, X. Lin, J.X. Yu, Efﬁcient similarity joins for near duplicate detection, in

WWW (2008), pp. 131–140

Privacy-Preserving Record Linkage for Big Data: Current Approaches … 895

173. M. Yakout, M. Atallah, A. Elmagarmid, Efﬁcient private record linkage, in IEEE ICDE (2009),

pp. 1283–1286

174. P. Zezula, G. Amato, V. Dohnal, M. Batko, Similarity Search: The Metric Space Approach,

vol. 32 (Springer, Berlin, 2006)

175. X. Zhang, C. Liu, S. Nepal, J. Chen, An efﬁcient quasi-identiﬁer index based approach for

privacy preservation over incremental data sets on cloud. J. Comput. Syst. Sci. 79(5), 542–555

(2013)

176. X. Zhang, C. Liu, S. Nepal, S. Pandey, J. Chen, A privacy leakage upper bound constraint-

based approach for cost-effective privacy preserving of intermediate data sets in cloud. IEEE

TPDS 24(6), 1192–1202 (2013)

A Multi-Party Privacy-Preserving Record Linkage Method Based on Secondary Encoding

Article

Full-text available

Jun 2024

With the advent of the big data era, data security and sharing have become the core elements of new-era data processing. Privacy-preserving record linkage (PPRL), as a method capable of accurately and securely matching and sharing the same entity across multiple data sources, is receiving increasing attention. Among the existing research methods, although PPRL methods based on Bloom Filter encoding excel in computational efficiency, they are susceptible to privacy attacks, and the security risks they face cannot be ignored. To balance the contradiction between security and computational efficiency, we propose a multi-party PPRL method based on secondary encoding. This method, based on Bloom Filter encoding, generates secondary encoding according to well-designed encoding rules and utilizes the proposed linking rules for secure matching. Owing to its excellent encoding and linking rules, this method successfully addresses the balance between security and computational efficiency. The experimental results clearly show that, in comparison to the original Bloom Filter encoding, this method has nearly equivalent computational efficiency and linkage quality. The proposed rules can effectively prevent the re-identification problem in Bloom Filter encoding (proven). Compared to existing privacy-preserving record linkage methods, this method shows higher security, making it more suitable for various practical application scenarios. The introduction of this method is of great significance for promoting the widespread application of privacy-preserving record linkage technology.

Fraud detection through data sharing using privacy‐preserving record linkage, digital signature (EdDSA), and the MinHash technique: Detect fraud using privacy preserving record links

Article

Full-text available

Dec 2023
JOE

Fraud is a persistent and increasing problem in the telecom industry. Telcos work in isolation to prevent fraud. Sharing information is critical for detecting and preventing fraud. The primary constraint on sharing information is privacy preservation. Several techniques have been developed to share data while preserving privacy using privacy‐preserving record linkage (PPRL). Most of the PPRL techniques use a similarity measure like Jacquard similarity on homologous datasets, which are all prone to graph‐based attacks, rendering existing methods insecure. Many complex and slow techniques use the Bloom filter implementation, which can be compromised in a cryptanalysis attack. This paper proposes an attack‐proof PPRL method using existing infrastructure of a telco without a complex multistep protocol. First, a novel way of matching two non‐homologous datasets using attack‐proof digital signature schemes, like the Edwards‐curve digital signature algorithm is proposed. Here, Jaccard similarity can only be estimated using this method and not on the datasets directly. Second, two parties transact with a simple request–reply method. To validate the match accuracy, privacy preservation, and performance of this approach, it was tested on a large public dataset (North Carolina Voter Database). This method is secure against attacks and achieves 100% match accuracy with improved performance.

Privacy-preserving Deep Learning based Record Linkage

Article

Full-text available

Jan 2023

italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Deep learning -based linkage of records across different databases is becoming increasingly useful in data integration and mining applications to discover new insights from multiple data sources. However, due to privacy and confidentiality concerns, organisations often are unwilling or allowed to share their sensitive data with any external parties, thus making it challenging to build/train deep learning models for record linkage across different organisations' databases. To overcome this limitation, we propose the first deep learning-based multi-party privacy-preserving record linkage (PPRL) protocol that can be used to link sensitive databases held by multiple different organisations. In our approach, each database owner first trains a local deep learning model, which is then uploaded to a secure environment and securely aggregated to create a global model. The global model is then used by a linkage unit to distinguish unlabelled record pairs as matches and non-matches. We utilise differential privacy to achieve provable privacy protection against re-identification attacks. We evaluate the linkage quality and scalability of our approach using several large real-world databases, showing that it can achieve high linkage quality while providing sufficient privacy protection against existing attacks.

Feasibility of Privacy-Preserving Entity Resolution on Confidential Healthcare Datasets Using Homomorphic Encryption

Preprint

Full-text available

May 2024

Patient datasets contain confidential information which is protected by laws and regulations such as HIPAA and GDPR. Ensuring comprehensive patient information necessitates privacy-preserving entity resolution (PPER), which identifies identical patient entities across multiple databases from different healthcare organizations while maintaining data privacy. Existing methods often lack cryptographic security or are computationally impractical for real-world datasets. We introduce a PPER pipeline based on AMPPERE, a secure abstract computation model utilizing cryptographic tools like homomorphic encryption. Our tailored approach incorporates extensive parallelization techniques and optimal parameters specifically for patient datasets. Experimental results demonstrate the proposed method's effectiveness in terms of accuracy and efficiency compared to various baselines.

Privacy‐preserving record linkage across disparate institutions and datasets to enable a learning health system: The national COVID cohort collaborative (N3C) experience

Article

Full-text available

Jan 2024

Introduction Research driven by real‐world clinical data is increasingly vital to enabling learning health systems, but integrating such data from across disparate health systems is challenging. As part of the NCATS National COVID Cohort Collaborative (N3C), the N3C Data Enclave was established as a centralized repository of deidentified and harmonized COVID‐19 patient data from institutions across the US. However, making this data most useful for research requires linking it with information such as mortality data, images, and viral variants. The objective of this project was to establish privacy‐preserving record linkage (PPRL) methods to ensure that patient‐level EHR data remains secure and private when governance‐approved linkages with other datasets occur. Methods Separate agreements and approval processes govern N3C data contribution and data access. The Linkage Honest Broker (LHB), an independent neutral party (the Regenstrief Institute), ensures data linkages are robust and secure by adding an extra layer of separation between protected health information and clinical data. The LHB's PPRL methods (including algorithms, processes, and governance) match patient records using “deidentified tokens,” which are hashed combinations of identifier fields that define a match across data repositories without using patients' clear‐text identifiers. Results These methods enable three linkage functions: Deduplication, Linking Multiple Datasets, and Cohort Discovery. To date, two external repositories have been cross‐linked. As of March 1, 2023, 43 sites have signed the LHB Agreement; 35 sites have sent tokens generated for 9 528 998 patients. In this initial cohort, the LHB identified 135 037 matches and 68 596 duplicates. Conclusion This large‐scale linkage study using deidentified datasets of varying characteristics established secure methods for protecting the privacy of N3C patient data when linked for research purposes. This technology has potential for use with registries for other diseases and conditions.

A pseudonymized corpus of occupational health narratives for clinical entity recognition in Spanish

Preprint

Full-text available

Jan 2024

Despite the high creation cost, annotated corpora are indispensable for robust natural language processing systems. In the clinical field, apart from annotating medical entities, personally identifiable information (PII) must be removed, especially in the era of large language models where unwanted memorization can occur. This paper presents a corpus annotated to anonymize personally identifiable information in 1,787 anamneses of work-related accidents and diseases in Spanish. In addition, we applied a previously released model for Named Entity Recognition (NER) trained on referrals from primary care physicians to identify diseases, body parts, and medications in this work-related text. We analyzed the differences between the models and the gold standard curated by a physician in detail. Moreover, we compared the performance of the NER model on the original narratives, in narratives where personal information has been masked, and in texts where the personal data is replaced by another similar expression (pseudonymization). Within this publication, we share the annotation guidelines and the annotated corpus.

Police and hospital data linkage for traffic injury surveillance: A systematic review

Article

Full-text available

Jan 2024
ACCIDENT ANAL PREV

This systematic review examines studies of traffic injury that involved linkage of police crash data and hospital data and were published from 1994 to 2023 worldwide in English. Inclusion and exclusion criteria were the basis for selecting papers from PubMed, Web of Science, and Scopus, and for identifying additional relevant papers using PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) and supplementary snowballing (n = 60). The selected papers were reviewed in terms of research objectives, data items and sample size included, temporal and spatial coverage, linkage methods and software tools, as well as linkage rates and most significant findings. Many studies found that the number of clinically significant road injury cases was much higher according to hospital data than crash data. Underestimation of cases in crash data differs by road user type, pedestrian cases commonly being highly under-counted. A limited number of the papers were from low-and middle-income countries. The papers reviewed lack consistency in what was reported and how, which limited comparability.

A multifaceted survey on privacy preservation of federated learning: progress, challenges, and opportunities

Article

Full-text available

Jun 2024
ARTIF INTELL REV

Federated learning (FL) refers to a system of training and stabilizing local machine learning models at the global level by aggregating the learning gradients of the models. It reduces the concern of sharing the private data of participating entities for statistical analysis to be carried out at the server. It allows participating entities called clients or users to infer useful information from their raw data. As a consequence, the need to share their confidential information with any other entity or the central entity called server is eliminated. FL can be clearly interpreted as a privacy-preserving version of traditional machine learning and deep learning algorithms. However, despite this being an efficient distributed training scheme, the client’s sensitive information can still be exposed to various security threats from the shared parameters. Since data has always been a major priority for any user or organization, this article is primarily concerned with discussing the significant problems and issues relevant to the preservation of data privacy and the viability and feasibility of several proposed solutions in the FL context. In this work, we conduct a detailed study on FL, the categorization of FL, the challenges of FL, and various attacks that can be executed to disclose the users’ sensitive data used during learning. In this survey, we review and compare different privacy solutions for FL to prevent data leakage and discuss secret sharing (SS)-based security solutions for FL proposed by various researchers in concise form. We also briefly discuss quantum federated learning (QFL) and privacy-preservation techniques in QFL. In addition to these, a comparison and contrast of several survey works on FL is included in this work. We highlight the major applications based on FL. We discuss certain future directions pertaining to the open issues in the field of FL and finally conclude our work.

Role Of Artificial Intelligence in Big Database Management

Article

Full-text available

May 2024

This exploration article digs into the significant effect of (artificial intelligence) AI, on big Data base management in the field of computer science by utilizing the broad abilities of artificial intelligence AI technologies, this study investigates different strategies to improve the oversight and administration of huge information bases. Through an exhaustive assessment of existing academic works, itemized contextual investigations, and master experiences, the article disentangles the mind-boggling combination of Ai’s integration and its important impacts on database management rehearses. Covering an extensive variety of artificial intelligence aspects, for example, AI, regular language handling, and profound realizing, this academic request plans to explain the complex job of computer-based intelligence in the domain of big database management. As a critical commitment to scholarly talk, this research article fills in as an original work, planning the groundbreaking excursion of Ai transformative into the center of data base management system, introducing another time of development and productivity in the computerized scene.

Encryption-based sub-string matching for privacy-preserving record linkage

Article

Mar 2024

Evaluation of advanced techniques for multi-party privacy-preserving record linkage on real-world health databases

Article

Full-text available

Apr 2017

Objective The linking of multiple (three or more) health databases is challenging because of the increasing sizes of databases, the number of parties among which they are to be linked, and privacy concerns related to the use of personal data such as names, addresses, or dates of birth. This entails a need to develop advanced scalable techniques for linking multiple databases while preserving the privacy of the individuals they contain. In this study we empirically evaluate several state-of-the-art multi-party privacy-preserving record linkage (MP-PPRL) techniques with large real-world health databases from Australia. ApproachMP-PPRL is conducted such that no sensitive information is revealed about database records that can be used to infer knowledge about individuals or groups of individuals. Current state-of-the-art methods used in this evaluation use Bloom filters to encode personal identifying information. The empirical evaluation comprises of different multi-party private blocking and matching techniques that are evaluated for different numbers of parties. Each database contains more than 700,000 records extracted from ten years of New South Wales (NSW) emergency presentation data. Each technique is evaluated with regard to scalability, quality and privacy. Scalability and quality are measured using the metrics of reduction ratio, pairs completeness, precision, recall, and F-measure. Privacy is measured using disclosure risk metrics that are based on the probability of suspicion, defined as the likelihood that a record in an encoded database matches to one or more record(s) in a publicly available database such as a telephone directory. MP-PPRL techniques that either utilize a trusted linkage unit, and those that do not, are evaluated. ResultsExperimental results showed MP-PPRL methods are practical for linking large-scale real world data. Private blocking techniques achieved significantly higher privacy than standard hashing-based techniques with a maximum disclosure risk of 0.0003 and 1, respectively, at a small cost to linkage quality and efficiency. Similarly, private matching techniques provided a similar acceptable reduction in linkage quality compared to standard non-private matching while providing high privacy protection. Conclusion The adoption of privacy-preserving linkage methods has the ability to significantly reduce privacy risks associated with linking large health databases, and enable the data linkage community to offer operational linkage services not previously possible. The evaluation results show that these state-of-the-art MP-PPRL techniques are scalable in terms of database sizes and number of parties, while providing significantly improved privacy with an associated trade-off in linkage quality compared to standard linkage techniques.

MERLIN -- A Tool for Multi-party Privacy-Preserving Record Linkage

Conference Paper

Full-text available

Nov 2015

Speeding up Privacy Preserving Record Linkage for Metric Space Similarity Measures

Article

Full-text available

Jun 2016

The analysis of person-related data in Big Data applications faces the tradeoff of finding useful results while preserving a high degree of privacy. This is especially challenging when person-related data from multiple sources need to be integrated and analyzed. Privacy-preserving record linkage (PPRL) addresses this problem by encoding sensitive attribute values such that the identification of persons is prevented but records can still be matched. In this paper we study how to improve the efficiency and scalability of PPRL by restricting the search space for matching encoded records. We focus on similarity measures for metric spaces and investigate the use of M‑trees as well as pivot-based solutions. Our evaluation shows that the new schemes outperform previous filter approaches by an order of magnitude.

Hashing-Based Distributed Multi-party Blocking for Privacy-Preserving Record Linkage

Conference Paper

Full-text available

Apr 2016

In many application domains organizations require information from multiple sources to be integrated. Due to privacy and confidentiality concerns often these organizations are not willing or allowed to reveal their sensitive and personal data to other database owners, and to any external party. This has led to the emerging research discipline of privacy-preserving record linkage (PPRL). We propose a novel blocking approach for multi-party PPRL to efficiently and effectively prune the record sets that are unlikely to match. Our approach allows each database owner to perform blocking independently except for the initial agreement of parameter settings and a final central hashing-based clustering. We provide an analysis of our technique in terms of complexity, quality, and privacy, and conduct an empirical study with large datasets. The results show that our approach is scalable with the size of the datasets and the number of parties, while providing better quality and privacy than previous multi-party private blocking approaches.

Dynamic Sorted Neighborhood Indexing Technique for Real-Time Entity Resolution

Article

Full-text available

Oct 2015

Real-time Entity Resolution (ER) is the process of matching query records in subsecond time with records in a database that represent the same real-world entity. Indexing techniques are generally used to efficiently extract a set of candidate records from the database that are similar to a query record, and that are to be compared with the query record in more detail. The sorted neighborhood indexing method, which sorts a database and compares records within a sliding window, has been successfully used for ER of large static databases. However, because it is based on static sorted arrays and is designed for batch ER that resolves all records in a database rather than resolving those relating to a single query record, this technique is not suitable for real-time ER on dynamic databases that are constantly updated. We propose a tree-based technique that facilitates dynamic indexing based on the sorted neighborhood method, which can be used for real-time ER, and investigate both static and adaptive window approaches. We propose an approach to reduce query matching times by precalculating the similarities between attribute values stored in neighboring tree nodes. We also propose a multitree solution where different sorting keys are used to reduce the effects of errors and variations in attribute values on matching quality by building several distinct index trees. We experimentally evaluate our proposed techniques on large real datasets, as well as on synthetic data with different data quality characteristics. Our results show that as the index grows, no appreciable increase occurs in both record insertion and query times, and that using multiple trees gives noticeable improvements on matching quality with only a small increase in query time. Compared to earlier indexing techniques for real-time ER, our approach achieves significantly reduced indexing and query matching times while maintaining high matching accuracy.

Article

Jan 2013

Privacy preserving data mining

Article

Jan 2002

Article

Jan 2008

Differential privacy

Article

Jan 2011

C. Dwork

Limited privacy protection and poor sensitivity: Is it time to move on from the statistical linkage key-581?

Article

May 2016

Background: The statistical linkage key (SLK-581) is a common tool for record linkage in Australia, due to its ability to provide some privacy protection. However, newer privacy-preserving approaches may provide greater privacy protection, while allowing high-quality linkage. Objective: To evaluate the standard SLK-581, encrypted SLK-581 and a newer privacy-preserving approach using Bloom filters, in terms of both privacy and linkage quality. Method: Linkage quality was compared by conducting linkages on Australian health datasets using these three techniques and examining results. Privacy was compared qualitatively in relation to a series of scenarios where privacy breaches may occur. Results: The Bloom filter technique offered greater privacy protection and linkage quality compared to the SLK-based method commonly used in Australia. Conclusion: The adoption of new privacy-preserving methods would allow both greater confidence in research results, while significantly improving privacy protection.

Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges

Abstract and Figures

Recommended publications

Multi-Party Privacy-Preserving Record Linkage using Bloom Filters

A Scalable Blocking Framework for Multidatabase Privacy-preserving Record Linkage

Scalable Multi-Database Privacy-Preserving Record Linkage using Counting Bloom Filters

Scalable Privacy-Preserving Linking of Multiple Databases Using Counting Bloom Filters

An Overview of Big Data Issues in Privacy-Preserving Record Linkage

Scalable and approximate privacy-preserving record linkage