ArticlePDF Available

Enhanced Privacy Preserving Model for Data Using (α, β, k)-Anonymity Model and Lossy join

Authors:

Abstract and Figures

This paper aims to provide enhancements in the privacy preserving model that was published in our previous paper entitled "An Effective Privacy Preserving Model for Databases Using (α, β, k) - Anonymity Model and Lossy Join" [1]. The previous paper includes a model that maintains the privacy of the multiple sensitive data after the publication of the data in two tables: one for QI-tuples and the other for sensitive attributes. This model used the connecting numbers which depend on one of the sensitive attributes as in lossy join technique. The authors found that in some cases there is a problem may arise with retrieving the exact frequency for any of the rest sensitive attributes if they are not included, as a set of attributes in the same tuple in sensitive attributes table. In other words, the frequency of any one of the rest sensitive attributes is different from the existing frequency of the same attribute in original table especially if the researcher doesn’t use all sensitive attributes in the same tuple together as a set. This problem may affect the ability of researchers to utilize the data and consequently affect the research accuracy. This paper proposed a solution for this problem by adding the frequency details in published sensitive data table for the sensitive attributes that are not used in making connecting numbers. The solution will increase the data utility and improve the research accuracy.
Content may be subject to copyright.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 13, No. 11, November 2015
60
http://sites.google.com/site/ijcsis/
ISSN 1947-5500
AbstractThis paper aims to provide enhancements in the
privacy preserving model that was published in our previous
paper entitled "An Effective Privacy Preserving Model for
Databases Using (α, β, k) - Anonymity Model and Lossy Join"
[1]. The previous paper includes a model that maintains the
privacy of the multiple sensitive data after the publication of
the data in two tables: one for QI-tuples and the other for
sensitive attributes. This model used the connecting numbers
which depend on one of the sensitive attributes as in lossy join
technique. The authors found that in some cases there is a
problem may arise with retrieving the exact frequency for any
of the rest sensitive attributes if they are not included, as a set
of attributes in the same tuple in sensitive attributes table. In
other words, the frequency of any one of the rest sensitive
attributes is different from the existing frequency of the same
attribute in original table especially if the researcher doesn’t
use all sensitive attributes in the same tuple together as a set.
This problem may affect the ability of researchers to utilize
the data and consequently affect the research accuracy. This
paper proposed a solution for this problem by adding the
frequency details in published sensitive data table for the
sensitive attributes that are not used in making connecting
numbers. The solution will increase the data utility and
improve the research accuracy.
Index TermsPrivacy Preserving Model, Anatomy
Technique, lossy join, Multiple Sensitive Attributes,
Connecting Numbers.
I. INTRODUCTION
Data mining is an increasingly important technology for
extracting useful knowledge hidden in huge collections of data
[2-6]. Data Mining also possible defined as an analysis
process of large quantities of data in order to discover
meaningful patterns and rules. There are, however, negative
social perceptions about data mining, among which potential
privacy violation and potential discrimination [7, 8]. Any data
mining model generally assumes that the underlying data is
freely accessible. The former is an unintentional or deliberate
disclosure of a user profile or activity data as part of the output
of a data mining algorithm or as a result of data sharing. Even
removing identifiers data is not secured, and causes linking
attacks [9]. For this reason, privacy preserving data mining has
been introduced to protect individual privacy. Privacy
preserving data mining (PPDM) has become more and more
important because it allows sharing of privacy sensitive
attributes for analytical purposes.
There are many privacy techniques were developed, one of
the most common is k-anonymity which is the emerging
concept for the protection of released data [10-15]. Anonymity
typically refers to the state on individuals personal identity or
personally identifiable information, being publically unknown.
When released information linked with confidential table may
cause data disclosures. Anonymity model introduced to
control linking attack. K-anonymity model suggests to convert
identifiers (Quasi identifiers, who are responsible for linking
attack) in such a manner that adversary doesn’t infer the
sensitive attributes related to them. On the other hand, it is
difficult for a data publisher to generate anonymous table,
when multiple sensitive attributes are present in data set
because concentrating to protect one sensitive attribute may
cause disclosure of identity due to another one [14]. An
attempt to solve that problem was introduced in [1] that
includes a proposed model that maintains the privacy of the
multiple sensitive attributes. This previous model solves this
problem by publication data in two tables: one for QI-tuples
and the other for sensitive attributes. It uses the connecting
numbers which depend on one of the sensitive attributes. In
the previous proposed model in [1], there is a problem may
arise if researcher intended to know the frequency of any one
of the rest sensitive attributes. The authors found that this
frequency is different from that in original table especially if
the researcher doesn’t treat all together as a set. Therefore,
authors proposed an enhanced new model to avoid this
problem using the frequency details in published sensitive
attributes table. This frequency details enable researcher to
know in exact the correct frequency number for each of the
rest sensitive attributes as explained later in this paper.
In the next section, the authors discuss multiple sensitive
attributes. Section III presents a previous attempt of privacy
preserving for databases, (α, β, k)-anonymity model, and
applies lossy join with k-anonymity techniques. Section IV
presents privacy preserving using anatomy technique. Section
V introduces implementation of the enhanced proposed model.
Enhanced Privacy Preserving Model for Data
Using (α, β, k)-Anonymity Model and Lossy join
Abou_el_ela Abdo Hussien1, Nagy Ramadan Darwish2
1Department of Computer Science, Shaqra University, KSA,
abo_el_ela_2004@yahoo.com
2Department of Computer and Information Sciences, Institute of Statistical Studies and Research, Cairo
University,
drnagyd@yahoo.com
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 13, No. 11, November 2015
61
http://sites.google.com/site/ijcsis/
ISSN 1947-5500
II. MULTIPLE SENSITIVE ATTRIBUTES
Sensitive attribute is an attribute whose value for some
particular individual must be kept secret from people who
have no direct access to the original data [1, 12]. Data
publisher needs to prevent privacy disclosure which means
someone can simply attack the published table "T" and at least
know the individuals' confidential information like knowing
that he could suffer from some kinds of dangerous disease
[13]. Information disclosure could be of three types as follows
[1, 14]:
Attribute disclosure: sensitive attribute information of
an individual is disclosed.
Identity disclosure: an individual is linked to a
particular record in the published data.
Membership disclosure: information about whether
an individual's record is in the published data or not
is disclosed.
K-anonymity model was introduced to protect sensitive
attributes from interlopers. Therefore, if an adversary wants to
search an individual's identity and has knowledge about quasi-
identifiers, he should find k-1 records that satisfy quasi-
identifiers [14]. On the other hand, when multiple sensitive
attributes are present in records, data publishers have to face a
big problem in maintaining privacy for all these attributes
together. Table I shows 4-anonymous inpatient microdata and
Table II shows a description of dataset [14, 15]. Table II
includes the sensitive attributes: "Medical Status",
"Occupation", and "Annual Income". When a data publisher
concentrates to protect one sensitive attribute may cause
disclosure of identity due to another one [14]. Therefore, we
need a model to control all sensitive attributes together.
III. A PREVIOUS ATTEMPT OF PRIVACY PRESERVING FOR
DATABASES
In this section, the authors present the previous paper
entitled "An Effective Privacy Preserving Model for
Databases Using (α, β, k) - Anonymity Model and Lossy Join"
[1]. The previous paper introduced a model that solves the
problem of maintaining the multiple sensitive attributes
privacy introduced in section II through the publication of data
in two tables: one for QI-tuples and the other for sensitive
attributes. In the following sub-sections, the authors will
present the main problem definition of previous proposed
model in [1], (α, β, k)-anonymity model and the previous
proposed algorithm for using k-anonymous model with lossy
join which helps to solve protecting multiple sensitive
attributes privacy problem [1].
A. The Previous Proposed Model Problem Definition:
The identities and accurately QI-attributes values of all
individuals could be mastered by an attacker, using
background knowledge [16]. This background knowledge can
be detected from external tables and be contained in an
equivalent class. Our previous proposed model intended to
solve this problem with multiple sensitive attributes that can
be explained using the following example:
TABLE I
4-Anonymous Inpatient Microdata
Ser. No
NONSENSITIVE
Zip Code
Age
Nationality
1
130***
>30
*
2
130***
>30
*
3
130***
>30
*
4
130***
>30
*
5
1485**
≥40
*
6
1485**
≥40
*
7
1485**
≥40
*
8
1485**
≥40
*
9
130***
3*
*
10
130***
3*
*
11
130***
3*
*
12
130***
3*
*
TABLE II
Classification of Attributes
Ser. No
ATTRIBUTE
TYPE
1
ZIPCODE
NON-SENSITIVE
2
AGE
NON-SENSITIVE
3
NATIONALITY
NON-SENSITIVE
4
MEDICAL_STATUS
SENSITIVE
5
OCCUPATION
SENSITIVE
6
ANNUAL_INCOME
SENSITIVE
Assume the data in Table III need to be published by
publishers, such as a hospital or an insurance company.
Both disease and salary are sensitive attributes.
Table IV is an anonymous data table of Table III.
Although disease attributes and the salary attributes
both conform to 3-diversity rules in Table IV, it cannot
prevent the attack of the background knowledge as
explained in the following cases:
o If an attacker knows information about someone
named "Ali" is in the second QI-group, and knows
that salary of "Ali" is not "2000" according he/she
has mastered background knowledge, and then the
attacker can infer that "Ali" suffered from
"Catatonia".
o If an attacker knows information about someone
named "Iman" is in the first QI-group and knows
that salary of "Iman" is not "6000", and then the
attacker can infer that "Iman" suffered from
"Depression".
o Although the sensitive attributes conform to L-
diversity [17], privacy information still is leaked.
The main reason is that there is a less diversity
between multiple sensitive attributes.
The previous proposed model in [1] solved this problem,
maintaining the privacy of the data to a large extent, as
explained in the following subsections.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 13, No. 11, November 2015
62
http://sites.google.com/site/ijcsis/
ISSN 1947-5500
TABLE III
Microdata
ID
QI
SA
Sex
Age
Zip code
Salary(S1)
Disease(S2)
T1
F
30
66212
6000
Headache
T2
F
28
66251
4000
Depression
T3
F
26
66253
2000
Depression
T4
F
26
66252
6000
Paranoia
T5
M
39
63223
5000
Catatonia
T6
M
40
65262
2000
Paranoia
T7
M
36
63232
6000
Catatonia
T8
M
35
65261
2000
Insomnia
TABLE IV
Anonymized Table
ID
QI
SA
Sex
Age
Zip code
Salary(S1)
Disease(S2)
T1
F
[25-30]
66***
6000
Headache
T2
F
[25-30]
66***
4000
Depression
T3
F
[25-30]
66***
2000
Depression
T4
F
[25-30]
66***
6000
Paranoia
T5
M
[35-40]
6****
5000
Catatonia
T6
M
[35-40]
6****
2000
Paranoia
T7
M
[35-40]
6****
6000
Catatonia
T8
M
[35-40]
6****
2000
Insomnia
B. (α, β, k)-Anonymity Model
Let a Table "T" that contains a set of attributes (A1, ...,An).
This attributes could be divided into two separate categories.
First category represents non-sensitive attributes (Q1, ... ,Qm)
and the second category represents sensitive attributes (S1, ...
.Si). The number of tuples is QIn in QI-group [18]. The number
of distinct values of sensitive attribute Si is nSi, and the
corresponding number of distinct sensitive attribute values is
nS'i in Si of all the same sensitive attribute values in Si-l. "T" is
said to satisfy (α, β, k) anonymity if and only if:
1) T satisfies k-anonymity,
2) the number of distinct values for each sensitive
attribute occur at least β times (2≤ β ≤k)within the
same QI-group, and
3) α = nSi - nS'i ≠1 in each QI-group of tuples.
To illustrate this anonymity approach, we analyze the data
from Table IV that satisfies 4-anonymity with respect to Sex,
Age and Zip code and includes two QI-groups.
The first group has three different diseases and three
different salaries,
The second group also has three different diseases
and three different salaries; Therefore β=3.
In the first group, nSl=nS2=3, nS'2=2 because the
corresponding distinct disease attribute values are
"Headache" and "Paranoia" of the same salary
attribute values {6000, 6000} in the salary attributes.
Thus, α = nS2 - nS' 2= 3-2=1, it is not satisfies (α, β,
k)-anonymity.
From previous analyses we know that Table IV will lead to
a leakage of privacy information, that is, if α =1, it will cause a
leakage if an attacker has a background knowledge. Previous
proposed model in [1] was adopted to solve the above
problem.
C. Applying Lossy Join with K-anonymity Technique
This section introduces lossy join technique explaining how
it is useful to conceal sensitive attributes and how to apply it
with (α, β, k)-Anonymity.
1) The Lossy Join Technique
In recent work, lossy Join is useful in privacy preserving
data publishing [19]. The idea of this technique is that if two
tables with a join attribute are published, the join of the two
tables can be lossy and this lossy Join, helps to conceal the
private information. The idea of lossy join is used to derive a
new mechanism for achieving a similar privacy preservation
target.
Let us have a look at an example in Table V, A (0.5, 2)-
anonymization. From this table, we can generate a Temp
table as shown in Table VI.
For each equivalence class "E" in the anonymized table,
author assigns a unique identifier (ID) to "E" and also to
all tuples in "E".
Then, author attaches the correspondence (ID) to each
tuple in the original raw table and forms a new table
"Temp", Table VI.
From the Temp table, we can generate two separate
tables, Tables VII (a) and VII (b).
The two tables share the attribute of ClassID.
If we join these two tables by the ClassID, it is easy to
see that the join is lossy and it is not possible to derive
the Temp table after the join.
The result of joining the two tables is given in Table
VIII.
TABLE V
A (0.5, 2)-anonymization Table
Job
Birth
Postcode
Disease
Clerk
1975
4350
HIV
manager
1955
4350
flu
clerk
1955
5432
flu
factory worker
1955
5432
fever
factory worker
1975
4350
flu
technical supporter
1940
4350
fever
TABLE VI
Temp Table
Job
Birth
Postcode
Disease
ClassID
Clerk
1975
4350
HIV
1
manager
1955
4350
flu
1
clerk
1955
5432
flu
2
factory worker
1955
5432
fever
2
factory worker
1975
4350
flu
3
technical supporter
1940
4350
fever
3
TABLE VII (a)
NSS Table
Job
Birth
Postcode
ClassID
Clerk
1975
4350
1
manager
1955
4350
1
Clerk
1955
5432
2
factory worker
1955
5432
2
factory worker
1975
4350
3
technical supporter
1940
4350
3
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 13, No. 11, November 2015
63
http://sites.google.com/site/ijcsis/
ISSN 1947-5500
TABLE VII (b)
SS Table
ClassID
Disease
1
HIV
1
Flu
2
Flu
2
Fever
3
Flu
3
Fever
TABLE VIII
Joining the Two Tables (7-a) & (7-b)
Job
Birth
Postcode
Disease
Class
ID
clerk
1975
4350
HIV
1
manager
1955
4350
HIV
1
clerk
1975
4350
flu
1
manager
1955
4350
Flu
1
clerk
1955
5432
Flu
2
factory
worker
1955
5432
Flu
2
clerk
1955
5432
fever
2
factory
worker
1955
5432
fever
2
factory
worker
1975
4350
flu
3
technical
worker
1940
4350
flu
3
factory
worker
1975
4350
fever
3
technical
worker
1940
4350
fever
3
From the lossy join, each individual is linked to at least 2
values in the sensitive attribute. Therefore, the required
privacy of individual can be guaranteed.
In the joined table, for each individual, there are at least
2 individuals that are linked to the same bag "B" of
sensitive attributes values, such that in terms of the
sensitive values, they are not distinguishable.
The first record in the raw table (QID= (clerk, 1975,
4350)) is linked to bag {HIV, flu}.
The second individual (QID = (manager, 1955, 4350)) is
also linked to the same bag "B" of sensitive attributes
values.
This is the goal of k-anonymity for the protection of
sensitive attributes values.
2) Applying Lossy Join Approach with (α, β, k)-
Anonymity Model
Lossy Join Technique is adopted to solve above problem in
sub-section III.A by previous proposed model in paper [1].
The author gives a set different number for each salary as
shown in Table IX in "Connecting Numbers" column, and
then uses these numbers to build both tables as shown in Table
X and Table XI. By joining the two Tables X and XI with
these connecting numbers Table XII could be produced.
To illustrate this anonymity approach, we analyze the data
from Table XII that satisfies 7-anonymity with respect to
"Sex", "Age" and "Zip code" includes two QI-groups [1] as
follows:
The first group has five different diseases and three
different salaries,
The second group also has five different diseases and
three different salaries; Therefore, at least β=3.
TABLE IX
Anonymized Table with Connecting Numbers.
ID
QI
SA
Connecting Numbers
Sex
Age
Zip
code
Salary(S1)
Disease(S2)
T1
F
[25-
30]
66***
6000
Headache
1(for 6000)
T2
F
[25-
30]
66***
4000
Depression
2(for 4000)
T3
F
[25-
30]
66***
2000
Depression
3(for 2000)
T4
F
[25-
30]
66***
6000
Paranoia
1(for 6000)
T5
M
[35-
40]
6****
5000
Catatonia
4(for 5000)
T6
M
[35-
40]
6****
2000
Paranoia
3(for 2000)
T7
M
[35-
40]
6****
6000
Catatonia
1(for 6000)
T8
M
[35-
40]
6****
2000
Insomnia
3(for 2000)
TABLE X
QI-Tuples with Connecting Numbers
ID
QI
Connecting
Numbers
Sex
Age
Zip code
T1
F
[25-30]
66***
1
T2
F
[25-30]
66***
2
T3
F
[25-30]
66***
3
T4
F
[25-30]
66***
1
T5
M
[35-40]
6****
4
T6
M
[35-40]
6****
3
T7
M
[35-40]
6****
1
T8
M
[35-40]
6****
3
TABLE XI
Sensitive attributes with Connecting numbers
Connecting
Numbers
SA
Salary(S1)
Disease(S2)
1
6000
Headache
2
4000
Depression
3
2000
Depression
1
6000
Paranoia
4
5000
Catatonia
3
2000
Paranoia
1
6000
Catatonia
3
2000
Insomnia
In the first group, nSl=3 nS2=5, nS'2=3 because the
corresponding distinct Disease attribute values are
"Headache", "Paranoia" and "Catanoia" of the same
salary attribute values {6000, 6000,6000} in the
salary attributes ,and
TABLE XII
Tuples with Sensitive using Connecting Numbers
ID
QI
Connecting Numbers
SA
Sex
Age
Zip
code
Salary(S1)
Disease(S2)
T1
F
[25-
30]
66***
1
6000
Headache
T2
F
[25-
30]
66***
1
6000
Paranoia
T3
F
[25-
30]
66***
1
6000
Catatonia
T4
F
[25-
30]
66***
2
4000
Depression
T5
F
[25-
30]
66***
3
2000
Paranoia
T6
F
[25-
30]
66***
3
2000
Depression
T7
F
[25-
30]
66***
3
2000
Insomnia
T8
M
[35-
40]
6****
3
2000
Paranoia
T9
M
[35-
40]
6****
3
2000
Depression
T10
M
[35-
40]
6****
3
2000
Insomnia
T11
M
[35-
40]
6****
4
5000
Catatonia
T12
M
[35-
40]
6****
1
6000
Headache
T13
M
[35-
40]
6****
1
6000
Paranoia
T14
M
[35-
40]
6****
1
6000
Catatonia
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 13, No. 11, November 2015
64
http://sites.google.com/site/ijcsis/
ISSN 1947-5500
o The corresponding distinct Disease attribute
values are "Depression", "Paranoia" and
"Insomina" of the same Salary attribute values
{2000, 2000, 2000} in the Salary attributes.
o Thus, α =nS2 - nS' 2=5-3=2, it satisfies (α, β, k)-
anonymity.
In the second group, nSl=3 nS2=5, nS'2=3 because the
corresponding distinct Disease attribute values are
"Headache", "Paranoia" and "Catanoia" of the same
salary attribute values {6000, 6000, 6000} in the
Salary attributes ,and
o The corresponding distinct Disease attribute
values are "Depression", "Paranoia" and
"Insomina" of the same salary attribute values
{2000, 2000, 2000} in the Salary attributes.
o Thus, α =nS2 - nS'2=5-3=2, it satisfies (α, β, k)-
anonymity.
Figure I represents (α, β, K) test Architecture and
Figure II represents the previous proposed model
architecture.
IV. PRIVACY PRESERVING USING ANATOMY TECHNIQUE
Anatomy technique aims to release two different tables
Quisi-Identifier (QI) attributes table and Sensitive Table (ST)
for Sensitive Attributes (SA) instead of publishing unique table
with the generalized values [20,21]. There is no need to
modify the original table because anatomy releases all QIs and
ST directly in two separate tables, which met L-diversity
privacy requirement [20]. Anatomy technique has been
proposed to overcome the disadvantages of generalization
which often losses considerable information in the microdata.
Anatomy captures the exact QI-distribution and releases two
tables, a quasi-identifier table (QIT) and a sensitive table (ST),
which separate QI-values from sensitive attributes values. For
example, Tables XIV (a) and XIV (b) demonstrate the QIT
and ST obtained from the microdata Table XIII, respectively
[20]. The technique methodology could be explained as
follows:
First, the microdata partitioned the records into
different QI-groups, based on a certain strategy. If
the reader following the grouping in Table XIII, he
will find that records from "1" to "4" are grouped into
QI-group number "1" and records from "5" to "8"
into QI-group number "2".
Second, the quasi-identifier table (QIT) has been
created. Specifically, for each record in Table XIII,
the QIT (Table XIV (a)) includes all its exact QI-
values, together with its group membership in a new
column Group-ID. However, QIT doesn’t have any
disease value.
Finally, it is possible to say that ST (Table XIV (b))
maintains the disease statistics of each QI-group.
The QIT doesn’t indicate the sensitive value of any record
which must randomly be guessed from the ST so anatomy
preserves privacy. To explain this, consider the adversary who
has the age "25" and Zip code "11500" of "Ali". Hence, from
the QIT (Table XIV (a)), the adversary knows that record "1"
belongs to "Ali", but doesn’t obtain any information about his
disease so far. Instead, s/he gets the id "1" of the QI-group
containing record "1". Judging from the ST (Table XIV (b)),
the adversary realizes that, among the "4" records in QI-group
"1", 50% of them are associated with "pneumonia" (or
"dyspepsia") in the microdata. Note that s/he doesn’t gain any
additional information, regarding the exact diseases carried by
these records. Hence, s/he could only expect that "Ali" could
have contracted "pneumonia" (or "dyspepsia") with 50%
probability.
FIGURE II
Previous Proposed Technique Architecture
FIGURE I
(α, β, k) Test Archetecture
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 13, No. 11, November 2015
65
http://sites.google.com/site/ijcsis/
ISSN 1947-5500
V. IMPLEMENTATION OF THE ENHANCED PROPOSED MODEL
The authors introduce the present problem definition with
an example that explains it and elucidates how the enhanced
proposed model solves this problem.
A. Present Problem Definition
In previous proposed model in [1] it is assumed that
researcher takes all sensitive attributes in the same tuple in the
sensitive table (ST) as a set. If researcher divides this tuple set
into separate sensitive attributes values he may face problem
especially if he needs to know the frequency of each separate
sensitive attribute (except those attribute that is used as a basis
for connecting numbers). The authors noticed this problem in
applying the previous proposed model as explained in the
following example:
When the authors take the two published Tables X and XI
mentioned before they noticed that if the researcher wants to
know the exact numbers of people who have the same
sensitive attribute he cannot reach the correct number as
explained in the next two cases:
Case I: When researcher tries to calculate total numbers
of people who have the same salary set he could only get
the frequency number from Table XI by counting
frequency of each number in that table as explained in
Table XV (a). From Table XV (a) for example we find
that the salary set (6000) has frequency = 3 which equal
exactly to the same frequency in original Table IX (as
tuples "T1", "T4" & "T7"). We could apply the same
thing for all other salary sets which give the same
frequency as original Table IX. The process of finding
the frequency number is easy to be retrieved because the
salary set is used as a basis for connecting numbers
between the two published tables.
TABLE XIV (a)
The Quasi-identifier Table (QIT)
Row Number
Age
Sex
Zipcode
Group-ID
1(Ali)
25
M
11500
1
2
29
M
13200
1
3
33
M
59300
1
4
55
M
12700
1
5
60
F
54600
2
6
59
F
25200
2
7(Hoda)
60
F
25100
2
8
58
F
31000
2
TABLE XIV (b)
The Sensitive Table (ST)
Group-ID
Disease
Count
1
Dyspepsia
2
1
Pneumonia
2
2
Bronchitis
1
2
Flu
2
2
Gastritis
1
Case II: When researcher tries to calculates total
numbers of people who have the same disease (for
example "Depression") he could return to Table XII to
know that "Depression" disease has connecting numbers
"2" and "3" and when researcher returns to Table X and
put "Depression" disease in front of the same connecting
numbers "2" and "3", he could build Table XV (b). From
Table XV (b) the researcher found that the total number
for people who are sick with "Depression" disease is "4"
people (explained with the same red color in Table XV
(b)). This number is different from the number in
original Table IX (as tuples "T2" & "T3") that equal only
"2", which consequently affects negatively with research
results accuracy.
From the previous display, it is clear that there is no
problem with the frequency of sensitive attribute used as a
basis for connecting numbers (Salary), but the problem arises
when we are trying to figure out the frequency of other
sensitive attribute (Disease).
TABLE XV (a)
Frequency of Each Salary Set According to Connecting Numbers in Table IX
Connecting
Number
SA
Salary(S1)
Salary Set Frequency
1
6000
3
2
4000
1
3
2000
3
4
5000
1
TABLE XV (b)
People who are Sick with Depression Disease According to Connecting
Numbers
ID
QI
Connecting
Numbers
Disease
Sex
Age
Zip code
T1
F
[25-30]
66***
1
T2
F
[25-30]
66***
2
Depression
T3
F
[25-30]
66***
3
Depression
T4
F
[25-30]
66***
1
T5
M
[35-40]
6****
4
T6
M
[35-40]
6****
3
Depression
T7
M
[35-40]
6****
1
T8
M
[35-40]
6****
3
Depression
TABLE XIII
The Microdata
Tuple ID
Age
Sex
Zipcode
Disease
1(Ali)
25
M
11500
Pneumonia
2
29
M
13200
dyspepsia
3
33
M
59300
dyspepsia
4
55
M
12700
pneumonia
5
60
F
54600
Flu
6
59
F
25200
gastritis
7(Hoda)
60
F
25100
Flu
8
58
F
31000
bronchitis
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 13, No. 11, November 2015
66
http://sites.google.com/site/ijcsis/
ISSN 1947-5500
B. The Enhanced Proposed Model
The authors provide solution for the present problem
explained in previous sub-section V.A by adding frequency
details column (as count column used in anatomy ST Table
XIV (b)). This column gives the exact number of sensitive
attribute frequency as in original table for the rest sensitive
attributes except that is used as a basis for connecting
numbers.
Frequency details column used only as a guide for researchers,
informing them about frequency number of sensitive attributes
(except that is used as a basis for connecting numbers in
original table), which reflects the accuracy of research results.
C. Applying the Proposed Solution
According to the enhanced proposed model, the solution
could be implemented as in the next two tables (XVI &XVII):
First Table XVI represents QI-Tuples with connecting
numbers as the same published table (Table X) in [1]
without any changes.
Second Table XVII represents sensitive attributes with
frequency details. In this table, the frequency details
presents only the frequency for sensitive attributes
(except that is used as a basis for connecting numbers in
original table) regardless of the connecting numbers or
category link salary. This table is different from sensitive
attributes table (Table XI) in [1] by adding this
frequency details column which helps researchers to
figure out the frequency number of all sensitive attributes
exactly. Proposed model architecture presented in Figure
III. TABLE XVI
QI-Tuples with Connecting Numbers
ID
QI
Connecting
Numbers
Sex
Age
Zip code
T1
F
[25-30]
66***
1
T2
F
[25-30]
66***
2
T3
F
[25-30]
66***
3
T4
F
[25-30]
66***
1
T5
M
[35-40]
6****
4
T6
M
[35-40]
6****
3
T7
M
[35-40]
6****
1
T8
M
[35-40]
6****
3
TABLE XVII
Sensitive Data with Connecting Numbers & Frequency Details
Connecting
Numbers
SA
Frequency
Details
For
Disease(S2)
Salary(S1)
Disease(S2)
1
6000
Headache
1
2
4000
Depression
2
3
2000
Depression
-
1
6000
Paranoia
2
4
5000
Catatonia
2
3
2000
Paranoia
-
1
6000
Catatonia
-
3
2000
Insomnia
1
FIGURE III
Proposed Technique Architecture
VI. CONCLUSION AND FUTURE WORK
This paper proposed to solve the problem that may occur in
our previous proposed model (α, β, k)-anonymity model in
[1]. Although the previous model has positive effect for
multiple sensitive attributes privacy and also it helps
anonymous data effectively to resist background knowledge
attack but one problem may occur. This problem may arise if
researcher tries to figure out the exact frequency number of
the rest sensitive attributes (except that is used as basis for
connecting numbers) and doesn’t consider all sensitive
attributes in the same tuple together as a set. In other words,
the frequency of any one of the rest sensitive attributes is
different from the existing frequency of the same attribute in
original table. Authors solve this problem by adding
frequency details in sensitive attributes table. By adding
frequency details, authors solve data utility problem and
make the model more efficient for both data privacy and data
utility. Frequency details affect research accuracy and help
researcher to find answers for some important questions,
especially for those imply the frequency number of any
sensitive attributes in original data table. Authors intends in
future solve the same problem using a hash function
technique.
REFERENCES
[1] Abou_el_ela Abdou Hussien, "An Effective Privacy Preserving
Model for Databases Using (α, β, k) - Anatomy Model and Lossy
Join", International Journal of Computer Networking, Wireless and
Mobile Communications, Vol. 3, Issue 1, pp.389-400, Mar, 2013.
[2] Mohammed J. Zaki, Limsoon Wong," Data Mining Techniques",
SPC/Lecture Notes Series: zaki-chap, August 9, 2003.
[3] Xingquan Zhu, Ian Davidson, "Knowledge Discovery and Data
Mining: Challenges and Realities", ISBN, Hershey, New York,
2007.
[4] Joseph, Zernik, "Data Mining as a Civic Duty Online Public
Prisoners Registration Systems", International Journal on Social
Media: Monitoring, Measurement, Mining, Vol.No.1, pp. 84-96,
September, 2010.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 13, No. 11, November 2015
67
http://sites.google.com/site/ijcsis/
ISSN 1947-5500
[5] Zhao, Kaidi and Liu, Bing, Tirpark, Thomas M. and Weimin,
Xiao, "A Visual Data Mining Framework for Convenient
Identification of Useful Knowledge", ICDM'05 Proceedings of the
Fifth IEEE International Conference on Data Mining, Vol.No-1,
pp. 530-537, December, 2005.
[6] Venkatadri.M and Lokanatha C. Reddy, "A Comparative Study on
Decision Tree Classification Algorithm in Data Mining",
International Journal of Computer Applications in Engineering,
Technology and Sciences (IJCAETS), Vol.No. 2, pp. 24- 29, Sept,
2010.
[7] Sara Hajian, "Simultaneous Discrimination Prevention and Privacy
Protection in Data Publishing and Mining", A Dissertation
Submitted to the Department of Computer Engineering and
Mathematics of Universitat Roviraivirili, 28 Jun, 2013.
[8] Jagriti Singh, S.S.Sane," Discrimination Discovery and Prevention
in Data Mining", International Journal of Engineering Sciences &
Research Technology, Vol.No.3, June, 2014.
[9] Abou_el_ela Abdou Hussien, Nermin Hamza, Hesham A. Hefny,
"Attacks on Anonymization-Based Privacy-Preserving: A Survey
for Data Mining and Data Publishing", Journal of Information
Security jis, Vol.No. 4, pp.101-112, April, 2013.
[10] P. Samarati and L. Sweeney, "Protecting Privacy When Disclosing
Information: k-Anonymity and Its Enforcement through
Generalization and Suppression", Technical Report SRI-CSL-98-
04, 1998.
[11] Ke Wang, Benjamin C. M. Fung, "Anonymizing Sequential
Releases", KDD’06, Philadelphia, Pennsylvania, USA, August 20–
23, 2006.
[12] Nidhi Maheshwarkar, Kshitij Pathak, Vivekananda Chourey,
"Performance Issues of Various K-anonymity Strategies",
International Journal of Computer Technology and Electronics
Engineering (IJCTEE), ISSN, 2011.
[13] Pierangela Samarati, Latanya Sweeney, "Protecting Privacy when
Disclosing Information: K-Anonymity and its enforcement through
Generalization and Suppression", Special Issue of International
Journal of Computer Applications on Optimization and On-chip
Communication, Vol.No.10, Feb, 2012.
[14] Nidhi Maheshwarkar MIT, Ujjain Kshitij Pathak MIT, Ujjain
Narendra S. Choudhari IIT," K-anonymity Model for Multiple
Sensitive Attributes", Special Issue of International Journal of
Computer Applications on Optimization and On-chip
Communication, Vol.No.10. Feb.2012.
[15] Nagendra kumar.S, Aparna.R, "Sensitive Attributes based Privacy
Preserving in Data Mining using k-anonymity", International
Journal of Computer Applications, December, 2013.
[16] Abou_el_ela Abdo Hussein, Nagy Ramadan Darwish, Hesham A.
Hefny, "Multiple-Published Tables Privacy-Preserving Data
Mining: A Survey for Multiple-Published Tables Techniques",
(IJACSA) International Journal of Advanced Computer Science
and Applications, Vol.No. 6, 2015.
[17] A. Machanavajjhala, J. Gehrke, D. Kifer, and M.
Venkitasubramaniam,"L-diversity: Privacy beyond k-anonymity".
In Proc. 22nd Conf. Data Engg. (ICDE), pp. 24, 2006.
[18] Yan Zhaol, Jian Wangl, Yongcheng Luo, Jiajin Le, "(α, β, k)-
anonymity: An effective Privacy Preserving Model for Databases",
International Conference on Test and Measurement, 2009.
[19] Raymond Chi-Wing Wong1, Yubao Liu2, Jian Yin2, Zhilan
Huang2, AdaWai-Chee Fu1, and Jian Pei," (α, k)-anonymity Based
Privacy Preservation by Lossy join", Lecture Notes in Computer
Science, pp.733-744, 2007.
[20] X. Xiao and Y. Tao, "Anatomy: Simple and effective privacy
preservation", In VLDB, 2006.
[21] Xianmang He, Yanghua Xiao, Yujia Li, Qing Wang,Wei Wang, B
aile Shi,"Permutation Anonymization: Improving Anatomy for
Privacy Preservation in Data Publication", the series Lecture Notes
in Computer Science, Vol.No.7104, pp.111-123,2012.
... Such preprocessing operations generally involve data distortion. The challenge then is to preprocess the data to meet some level of privacy measure on one hand, while preserving and achieving the utility of the data on the other hand [4]- [6]. ...
... To explain NCP, next example will be elaborated. (6) where ai is the domain attribute in the anatomized group. So, for gender based diversity, NCP for gender attribute is always equal to ½ * the number of records in all the groups. ...
Article
Full-text available
With large growth in technology, reduced cost of storage media and networking enabled the organizations to collect very large volume of information from huge sources. Different data mining techniques are applied on such huge data to extract useful and relevant knowledge. The disclosure of sensitive data to unauthorized parties is a critical issue for organizations which could be most critical problem of data mining. So Privacy preserving data mining (PPDM) has become increasingly popular because it solves this problem and allows sharing of privacy sensitive data for analytical purposes. A lot of privacy techniques were developed based on the k-anonymity property. Because of a lot of shortcomings of the k-anonymity model, other privacy models were introduced. Most of these techniques release one table for research public after they applied on original tables. In this paper the researchers introduce techniques which publish more than one table for organizations preserving individual's privacy. One of this is (α, k) – anonymity using lossy-Join which releases two tables for publishing in such a way that the privacy protection for (α, k)-anonymity can be achieved with less distortion, and the other one is Anatomy technique which releases all the quasi-identifier and sensitive values directly in two separate tables, met l-diversity privacy requirements, without any modification in the original table.
Article
Full-text available
Data mining is the extraction of vast interesting patterns or knowledge from huge amount of data. The initial idea of privacy-preserving data mining PPDM was to extend traditional data mining techniques to work with the data modified to mask sensitive information. The key issues were how to modify the data and how to recover the data mining result from the modified data. Privacy-preserving data mining considers the problem of running data mining algorithms on confidential data that is not supposed to be revealed even to the party running the algorithm. In contrast, privacy preserving data publishing (PPDP) may not necessarily be tied to a specific data mining task, and the data mining task may be unknown at the time of data publishing. PPDP studies how to transform raw data into a version that is immunized against privacy attacks but that still supports effective data mining tasks. Privacy-preserving for both data mining (PPDM) and data publishing (PPDP) has become increasingly popular because it allows sharing of privacy sensitive data for analysis purposes. One well studied approach is the k-anonymity model [1] which in turn led to other models such as confidence bounding, l-diversity, t-closeness, (α,k)-anonymity, etc. In particular, all known mechanisms try to minimize information loss and such an attempt provides a loophole for attacks. The aim of this paper is to present a survey for most of the common attacks techniques for anonymization-based PPDM & PPDP and explain their effects on Data Privacy.
Article
Full-text available
Privacy is becoming an increasingly important issue in many data mining applications. This has triggered the development of many privacy-preserving data mining techniques. The proper protection of personal information is increasingly becoming an important issue in an age where misuse Personal Information and identity theft are widespread. At times there is a need however for management or statistical purposes based on personal information in aggregated form. The k-anonymization technique has been developed to de-associate sensitive attributes and anonymise the information needed to a point where the identity and associated details cannot be reconstructed. The protection of personal information has manifested itself in various forms, ranging from legislation, to policies such as P3P and also information systems such as Hippocratic database. Unfortunately, none of these provide support for statistical data research and analysis. The traditional k-anonymity technique proposes used to protect released data. Released data which is available for public used may contain sensitive and non-sensitive data. But K-anonymity model faces changes when set of sensitive attributes are present in the data set. In this paper, we use a novel privacy preserving model based on K-anonimty called (α,β,k)-anonymity for databases [1] can be used to protect data with multiple sensitive attributes. Then we propose Loosy-join K-anonimty model which can effectively protect privacy information of individual and resist background knowledge attack with multiple sensitive attributes.
Article
Full-text available
Data mining is the semi-automatic discovery of patterns, associations, changes, anomalies, and statistically significant structures and events in data. Traditional data analysis is assumption driven in the sense that a hypothesis is formed and validated against the data. Data mining, in contrast, is data driven in the sense that patterns are automatically ex-tracted from data. The goal of this tutorial is to provide an introduction to data mining techniques. The focus will be on methods appropriate for mining massive datasets using techniques from scalable and high perfor-mance computing. The techniques covered include association rules, se-quence mining, decision tree classification, and clustering. Some aspects of preprocessing and postprocessing are also covered. The problem of predicting contact maps for protein sequences is used as a detailed case study. The material presented here is compiled by LW based on the original tutorial slides of MJZ at the 2002 Post-Genome Knowledge Discovery Programme in Singapore.
Article
Publishing the data with multiple sensitive attributes brings us greater challenge than publishing the data with single sensitive attribute in the area of privacy preserving. In this paper, we propose a novel privacy preserving model based on k-anonymity called (α, β, k)-anonymity for databases. (α, β, k)-anonymity can be used to protect data with multiple sensitive attributes in data publishing. Then, we set a hierarchy sensitive attribute rule to achieve (α, β, k)-anonymity model and develop the corresponding algorithm to anonymize the microdata by using generalization and hierarchy. We verify (α, β, k)-anonymity approach can effectively protect privacy information of individual and resist background knowledge attack in publishing the data with multiple sensitive attributes by specific example.
Article
Data mining is an increasingly important technology for extracting useful knowledge hidden in large collections of data. There are, however, negative social perceptions about data mining, among which potential privacy violation and potential discrimination. Automated data collection and data mining techniques such as classification have paved the way to making automated decisions, like loan granting/denial, insurance premium computation. If the training datasets are biased in what regards discriminatory attributes like gender, race, religion, discriminatory decisions may ensue. In the first part of this thesis, we tackle discrimination prevention in data mining and propose new techniques applicable for direct or indirect discrimination prevention individually or both at the same time. We discuss how to clean training datasets and outsourced datasets in such a way that direct and/or indirect discriminatory decision rules are converted to legitimate (non-discriminatory) classification rules. In the second part of this thesis, we argue that privacy and discrimination risks should be tackled together. We explore the relationship between privacy preserving data mining and discrimination prevention in data mining to design holistic approaches capable of addressing both threats simultaneously during the knowledge discovery process. As part of this effort, we have investigated for the first time the problem of discrimination and privacy aware frequent pattern discovery, i.e. the sanitization of the collection of patterns mined from a transaction database in such a way that neither privacy-violating nor discriminatory inferences can be inferred on the released patterns. Moreover, we investigate the problem of discrimination and privacy aware data publishing, i.e. transforming the data, instead of patterns, in order to simultaneously fulfill privacy preservation and discrimination prevention.
Article
Knowledge discovery and data mining (KDD) is dedicated to exploring meaningful information from a large volume of data. Knowledge Discovery and Data Mining: Challenges and Realities is the most comprehensive reference publication for researchers and real-world data mining practitioners to advance knowledge discovery from low-quality data. This Premier Reference Source presents in-depth experiences and methodologies, providing theoretical and empirical guidance to users who have suffered from underlying, low-quality data. International experts in the field of data mining have contributed all-inclusive chapters focusing on interdisciplinary collaborations among data quality, data processing, data mining, data privacy, and data sharing.