ArticlePDF Available

An Efficient Privacy-Preserving Ranked Keyword Search Method

Authors:

Abstract and Figures

Cloud data owners prefer to outsource documents in an encrypted form for the purpose of privacy preserving. Therefore it is essential to develop efficient and reliable ciphertext search techniques. One challenge is that the relationship between documents will be normally concealed in the process of encryption, which will lead to significant search accuracy performance degradation. Also the volume of data in data centers has experienced a dramatic growth. This will make it even more challenging to design ciphertext search schemes that can provide efficient and reliable online information retrieval on large volume of encrypted data. In this paper, a hierarchical clustering method is proposed to support more search semantics and also to meet the demand for fast ciphertext search within a big data environment. The proposed hierarchical approach clusters the documents based on the minimum relevance threshold, and then partitions the resulting clusters into sub-clusters until the constraint on the maximum size of cluster is reached. In the search phase, this approach can reach a linear computational complexity against an exponential size increase of document collection. In order to verify the authenticity of search results, a structure called minimum hash sub-tree is designed in this paper. Experiments have been conducted using the collection set built from the IEEE Xplore. The results show that with a sharp increase of documents in the dataset the search time of the proposed method increases linearly whereas the search time of the traditional method increases exponentially. Furthermore, the proposed method has an advantage over the traditional method in the rank privacy and relevance of retrieved documents.
Content may be subject to copyright.
1045-9219 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems
1
An Efficient Privacy-Preserving Ranked
Keyword Search Method
Chi Chen, Member, IEEE, Xiaojie Zhu, Student Member, IEEE, Peisong Shen, Student
Member, IEEE, J.Hu, Member, IEEE, S.Guo, Senior Member, IEEE, Z.Tari, Senior Member, IEEE,
and Albert Y. Zomaya, Fellow, IEEE,
Abstract—Cloud data owners prefer to outsource documents in an encrypted form for the purpose of privacy preserving.
Therefore it is essential to develop efficient and reliable ciphertext search techniques. One challenge is that the relationship
between documents will be normally concealed in the process of encryption, which will lead to significant search accuracy
performance degradation. Also the volume of data in data centers has experienced a dramatic growth. This will make it even
more challenging to design ciphertext search schemes that can provide efficient and reliable online information retrieval on large
volume of encrypted data. In this paper, a hierarchical clustering method is proposed to support more search semantics and
also to meet the demand for fast ciphertext search within a big data environment. The proposed hierarchical approach clusters
the documents based on the minimum relevance threshold, and then partitions the resulting clusters into sub-clusters until the
constraint on the maximum size of cluster is reached. In the search phase, this approach can reach a linear computational
complexity against an exponential size increase of document collection. In order to verify the authenticity of search results, a
structure called minimum hash sub-tree is designed in this paper. Experiments have been conducted using the collection set
built from the IEEE Xplore. The results show that with a sharp increase of documents in the dataset the search time of the
proposed method increases linearly whereas the search time of the traditional method increases exponentially. Furthermore, the
proposed method has an advantage over the traditional method in the rank privacy and relevance of retrieved documents.
Index Terms—Cloud computing, ciphertext search, ranked search, multi-keyword search, hierarchical clustering, big data,
security
F
1 INTRODUCTION
AS we step into the big data era, terabyte of
data are produced world-wide per day. Enter-
prises and users who own a large amount of data
usually choose to outsource their precious data to
An early version of this paper is presented at Workshop BigSecurity
with IEEE INFOCOM 2014 [28]. Extensive enhancements have
been made which includes incorporating a novel verification scheme
to help data user verify the authenticity of the search results, and
adding a security analysis as well more details of the proposed
scheme. This work was supported by Strategic Priority Research
Program of Chinese Academy of Sciences (No.XDA06010701) and
National High Technology Research and Development Program of
China(No.2013AA01A24).
Chi Chen is now with the State Key Laboratory Of Information
Security, Institute of Information Engineering, Chinese Academy of
Sciences, Beijing, China (e-mail: chenchi@iie.ac.cn).
Xiaojie Zhu is now with the State Key Laboratory Of Information
Security, Institute of Information Engineering, Chinese Academy of
Sciences, Beijing, China (e-mail: zhuxiaojie@iie.ac.cn).
Peisong Shen is now with the State Key Laboratory Of Information
Security, Institute of Information Engineering, Chinese Academy of
Sciences, Beijing, China (e-mail: shenpeisong@iie.ac.cn).
J.Hu is now with the Cyber Security Lab, School of Engineering and
IT, University of New South Wales at the Australian Defence Force
Academy, Canberra, ACT 2600, Australia. (e-mail: J.Hu@adfa.edu.au).
Song Guo is with School of Computer Science and Engineering, The
University of Aizu, Japan. (email: sguo@u-aizu.ac.jp).
Zahir Tari is with School of Computer Science, RMIT University, Aus-
tralia. (email: zahir.tari@rmit.edu.au).
Albert Zomaya is with School of Information Technologies, The Uni-
versity of Sydney, Australia. (email: albert.zomaya@sydney.edu.au).
cloud facility in order to reduce data management
cost and storage facility spending. As a result, data
volume in cloud storage facilities is experiencing a
dramatic increase. Although cloud server providers
(CSPs) claim that their cloud service is armed with
strong security measures, security and privacy are
major obstacles preventing the wider acceptance of
cloud computing service[1].
A traditional way to reduce information leakage is
data encryption. However, this will make server-side
data utilization, such as searching on encrypted data,
become a very challenging task. In the recent years,
researchers have proposed many ciphertext search
schemes [35-38][43] by incorporating the cryptogra-
phy techniques. These methods have been proven
with provable security, but their methods need mas-
sive operations and have high time complexity. There-
fore, former methods are not suitable for the big data
scenario where data volume is very big and applica-
tions require online data processing. In addition, the
relationship between documents is concealed in the
above methods. The relationship between documents
represents the properties of the documents and hence
maintaining the relationship is vital to fully express a
document. For example, the relationship can be used
to express its category. If a document is independent
of any other documents except those documents that
are related to sports, then it is easy for us to assert this
document belongs to the category of the sports. Due
1045-9219 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems
2
to the blind encryption, this important property has
been concealed in the traditional methods. Therefore,
proposing a method which can maintain and utilize
this relationship to speed the search phase is desirable.
On the other hand, due to software/hardware fail-
ure, and storage corruption, data search results re-
turning to the users may contain damaged data or
have been distorted by the malicious administrator
or intruder. Thus, a verifiable mechanism should be
provided for users to verify the correctness and com-
pleteness of the search results.
In this paper, a vector space model is used and
every document is represented by a vector, which
means every document can be seen as a point in
a high dimensional space. Due to the relationship
between different documents, all the documents can
be divided into several categories. In other words, the
points whose distance are short in the high dimen-
sional space can be classified into a specific category.
The search time can be largely reduced by selecting
the desired category and abandoning the irrelevant
categories. Comparing with all the documents in the
dataset, the number of documents which user aims
at is very small. Due to the small number of the
desired documents, a specific category can be further
divided into several sub-categories. Instead of using
the traditional sequence search method, a backtrack-
ing algorithm is produced to search the target doc-
uments. Cloud server will first search the categories
and get the minimum desired sub-category. Then the
cloud server will select the desired k documents from
the minimum desired sub-category. The value of k is
previously decided by the user and sent to the cloud
server. If current sub-category can not satisfy the k
documents, cloud server will trace back to its parent
and select the desired documents from its brother
categories. This process will be executed recursively
until the desired k documents are satisfied or the
root is reached. To verify the integrity of the search
result, a verifiable structure based on hash function is
constructed. Every document will be hashed and the
hash result will be used to represent the document.
The hashed results of documents will be hashed again
with the category information that these documents
belong to and the result will be used to represent
the current category. Similarly, every category will
be represented by the hash result of the combination
of current category information and sub-categories
information. A virtual root is constructed to represent
all the data and categories. The virtual root is denoted
by the hash result of the concatenation of all the
categories located in the first level. The virtual root
will be signed so that it is verifiable. To verify the
search result, user only needs to verify the virtual root,
instead of verifying every document.
2 EXISTING SOLUTIONS
In recent years, searchable encryption which provides
text search function based on encrypted data has
been widely studied, especially in security definition,
formalizations and efficiency improvement, e.g. [2-7].
As shown in Fig.1, the proposed method is compared
with existing solutions and has the advantage in
maintaining the relationship between documents.
2.1 Single Keyword Searchable Encryption
Song et al [8] first introduced the notion of search-
able encryption. They propose to encrypt each word
in the document independently. This method has a
high searching cost due to the scanning of the whole
data collection word by word. Goh et al [9] formally
defined a secure index structure and formulate a
security model for index known as semantic security
against adaptive chosen keyword attack (ind-cka).
They also developed an efficient ind-cka secure index
construction called z-idx by using pseudo-random
functions and bloom filters. Cash et al [42] recently
design and implement an efficient data structure.
Due to the lack of rank mechanism, users have to
take a long time to select what they want when
massive documents contain the query keyword. Thus,
the order-preserving techniques are utilized to realize
the rank mechanism, e.g. [10-12]. Wang et al [13]
use encrypted invert index to achieve secure ranked
keyword search over the encrypted documents. In the
search phase, the cloud server computes the relevance
score between documents and the query. In this way,
relevant documents are ranked according to their
relevance score and users can get the top-k results.
In the public key setting, Boneh et al [3] designed
the first searchable encryption construction, where
anyone can use public key to write to the data stored
on server but only authorized users owning private
key can search. However, all the above mentioned
techniques only support single keyword search.
2.2 Multiple Keyword Searchable Encryption
To enrich search predicates, a variety of conjunctive
keyword search methods (e.g. [7, 14-17]) have been
proposed. These methods show large overhead, such
as communication cost by sharing secret, e.g. [15], or
computational cost by bilinear map, e.g.[7]. Pang et al
[18] propose a secure search scheme based on vector
space model. Due to the lack of the security analysis
for frequency information and practical search per-
formance, it is unclear whether their scheme is secure
and efficient or not. Cao et al [19] present a novel
architecture to solve the problem of multi-keyword
ranked search over encrypted cloud data. But the
search time of this method grows exponentially ac-
companying with the exponentially increasing size of
the document collections. Sun et al [20] give a new
1045-9219 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems
3
architecture which achieves better search efficiency.
However, at the stage of index building process, the
relevance between documents is ignored. As a result,
the relevance of plaintexts is concealed by the encryp-
tion, users expectation cannot be fulfilled well. For
example: given a query containing Mobile and Phone,
only the documents containing both of the keywords
will be retrieved by traditional methods. But if tak-
ing the semantic relationship between the documents
into consideration, the documents containing Cell and
Phone should also be retrieved. Obviously, the second
result is better at meeting the users expectation.
2.3 Verifiable Search Based on Authenticated In-
dex
The idea of data verification has been well studied in
the area of databases. In a plaintext database scenario,
a variety of methods have been produced, e.g. [21-
23]. Most of these works are based on the original
work by Merkle [24, 25] and refinements by Naor and
Nissm [26] for certificate revocation. Merkle hash tree
and cryptographic signature techniques are used to
construct authenticated tree structure upon which end
users can verify the correctness and completeness of
the query results.
Pang et al [27] apply the Merkle hash tree based on
authenticated structure to text search engines. How-
ever, they only focus on the verification-specific issues
ignoring the search privacy preserving capabilities
that will be addressed in this paper.
The hash chain is used to construct a single key-
word search result verification scheme by Wang et al
[10]. Sun et al [20] use Merkle hash tree and cryp-
tographic signature to create a verifiable MDB-tree.
However, their work cannot be directly used in our
architecture which is oriented for privacy-preserving
multiple keyword search. Thus, a proper mechanism
that can be used to verify the search results within
big data scenario is essential to both the CSPs and
end users.
3 OU R CONTRIBUTION
In this paper, we propose a multi-keyword ranked
search over encrypted data based on hierarchical
clustering index (MRSE-HCI) to maintain the close
relationship between different plain documents over
the encrypted domain in order to enhance the search
efficiency. In the proposed architecture, the search
time has a linear growth accompanying with an ex-
ponential growing size of data collection. We derive
this idea from the observation that users retrieval
needs usually concentrate on a specific field. So we
can speed up the searching process by computing
relevance score between the query and documents
which belong to the same specific field with the query.
As a result, only documents which are classified to
the field specified by users query will be evaluated to
get their relevance score. Due to the irrelevant fields
ignored, the search speed is enhanced.
We investigate the problem of maintaining the
close relationship between different plain documents
over an encrypted domain and propose a cluster-
ing method to solve this problem. According to the
proposed clustering method, every document will be
dynamically classified into a specific cluster which
has a constraint on the minimum relevance score
between different documents in the dataset. The rele-
vance score is a metric used to evaluate the relation-
ship between different documents. Due to the new
documents added to a cluster, the constraint on the
cluster may be broken. If one of the new documents
breaks the constraint, a new cluster center will be
added and the current document will be chosen as
a temporal cluster center. Then all the documents
will be reassigned and all the cluster centers will be
reelected. Therefore, the number of clusters depends
on the number of documents in the dataset and the
close relationship between different plain documents.
In other words, the cluster centers are created dynam-
ically and the number of clusters is decided by the
property of the dataset.
We propose a hierarchical method in order to get
a better clustering result within a large amount of
data collection. The size of each cluster is controlled
as a trade-off between clustering accuracy and query
efficiency. According to the proposed method, the
number of clusters and the minimum relevance score
increase with the increase of the levels whereas the
maximum size of a cluster reduces. Depending on the
needs of the grain level, the maximum size of a cluster
is set at each level. Every cluster needs to satisfy the
constraints. If there is a cluster whose size exceeds
the limitation, this cluster will be divided into several
sub-clusters.
We design a search strategy to improve the rank
privacy. In the search phase, the cloud server will
first compute the relevance score between query and
cluster centers of the first level and then chooses the
nearest cluster. This process will be iterated to get
the nearest child cluster until the smallest cluster has
been found. The cloud server computes the relevance
score between query and documents included in the
smallest cluster. If the smallest cluster can not satisfy
the number of desired documents which is previously
decided by user, cloud server will trace back to the
parent cluster of the smallest cluster and the brother
clusters of the smallest cluster will be searched. This
process will be iterated until the number of desired
documents is satisfied or the root is reached. Due to
the special search procedures, the rankings of docu-
ments among their search results are different with
the rankings derived from traditional sequence search.
Therefore, the rank privacy is enhanced.
Some part of the above work has been presented
in [28]. For further improvement, we also construct
1045-9219 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems
4
a verifiable tree structure upon the hierarchical clus-
tering method to verify the integrity of the search
result in this paper. This authenticated tree structure
mainly takes the advantage of the Merkle hash tree
and cryptographic signature. Every document will be
hashed and the hash result will be used as the repre-
sentative of the document. The smallest cluster will be
represented by the hash result of the combination of
the concatenation of the documents included in the
smallest cluster and own category information. The
parent cluster is represented by the hash result of the
combination of the concatenation of its children and
own category information. A virtual root is added and
represented by the hash result of the concatenation of
the categories located in the first level. In addition,
the virtual root will be signed so that user can achieve
the goal of verifying the search result by verifying the
virtual root.
In short, our contributions can be summarized as
follows:
1) We investigate the problem of maintaining the
close relationship between different plain docu-
ments over an encrypted domain and propose a
clustering method to solve this problem.
2) We proposed the MRSE-HCI architecture to
speed up server-side searching phase. Accompa-
nying with the exponential growth of document
collection, the search time is reduced to a linear
time instead of exponential time.
3) We design a search strategy to improve the rank
privacy. This search strategy adopts the back-
tracking algorithm upon the above clustering
method. With the growing of the data volume,
the advantage of the proposed method in rank
privacy tends to be more apparent.
4) By applying the Merkle hash tree and crypto-
graphic signature to authenticated tree structure,
we provide a verification mechanism to assure
the correctness and completeness of search re-
sults.
The organization of the following parts of the paper
is as follows: Section IV describes the system model,
threat model, design goals and notations. The archi-
tecture and detailed algorithm are displayed in section
V. We discuss the efficiency and security of MRSE-
HCI scheme in section VI. An evaluation method is
provided in Section VII. Section VIII demonstrates the
result of our experiments. Section IX concludes the
paper.
4 DEFINITION AND BACKGROUND
4.1 System Model
The system model contains three entities, as illus-
trated in Fig. 1, the data owner, the data user, and
the cloud server.The box with dashed lines in the
figure indicates the added component to the existing
architecture.
Fig. 1 Architecture of ciphertext search
The data owner is responsible for collecting doc-
uments, building document index and outsourcing
them in an encrypted format to the cloud server.
Apart from that, the data user needs to get the autho-
rization from the data owner before accessing to the
data. The cloud server provides a huge storage space,
and the computation resources needed by ciphertext
search. Upon receiving a legal request from the data
user, the cloud server searches the encrypted index,
and sends back top-k documents that are most likely
to match users query [12]. The number k is properly
chosen by the data user. Our system aims at protecting
data from leaking information to the cloud server
while improving the efficiency of ciphertext search.
In this model, both the data owner and the data user
are trusted, while the cloud server is semi-trusted,
which is consistent with the architecture in [10, 19, 29].
In other words, the cloud server will strictly follow
the predicated order and try to get more information
about the data and the index.
4.2 Threat Model
The adversarys ability can be concluded in two threat
models.
Known Ciphertext Model
In this model, Cloud server can get encrypted docu-
ment collection, encrypted data index, and encrypted
query keywords.
Known Background Model
In this model, cloud server knows more informa-
tion than that in known ciphertext model. Statistical
background information of dataset, such as the docu-
ment frequency and term frequency information of a
specific keyword, can be used by the cloud server to
launch a statistical attack to infer or identify specific
keyword in the query [10, 11], which further reveals
the plaintext content of documents. The adversarys
ability can be represented in the above two threat
models.
4.3 Design Goals
Search efficiency. The time complexity of search
time of the MRSE-HCI scheme needs to be loga-
rithmic against the size of data collection in order
to deal with the explosive growth of document
size in big data scenario.
1045-9219 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems
5
Retrieval accuracy. Retrieval precision is related
to two factors: the relevance between the query
and the documents in result set, and the relevance
of documents in the result set.
Integrity of the search result. The integrity of the
search results includes three aspects:
1) Correctness. All the documents returned
from servers are originally uploaded by the
data owner and remain unmodified.
2) Completeness. No qualified documents are
omitted from the search results.
3) Freshness. The returned documents are the
latest version of documents in the dataset.
Privacy requirements. We set a series of privacy
requirements which current researchers mostly
focus on.
1) Data privacy. Data privacy presents the con-
fidentiality and privacy of documents. The
adversary cannot get the plaintext of doc-
uments stored on the cloud server if data
privacy is guaranteed. Symmetric cryptog-
raphy is a conventional way to achieve data
privacy.
2) Index privacy. Index privacy means the abil-
ity to frustrate the adversary attempt to
steal the information stored in the index.
Such information includes keywords and the
TF (Term Frequency) of keywords in docu-
ments, the topic of documents, and so on.
3) Keyword privacy. It is important to protect
users query keywords. Secure query gen-
eration algorithm should output trapdoors
which leak no information about the query
keywords.
4) Trapdoor unlinkability. Trapdoor unlinkabil-
ity means that each trapdoor generated from
the query is different, even for the same
query. It can be realized by integrating a
random function in the trapdoor generation
process. If the adversary can deduce the cer-
tain set of trapdoors which all corresponds
to the same keyword, he can calculate the
frequency of this keyword in search request
in a certain period. Combined with the docu-
ment frequency of keyword in known back-
ground model, he/she can use statistical
attack to identify the plain keyword behind
these trapdoors.
5) Rank privacy. Rank order of search results
should be well protected. If the rank or-
der remains unchanged, the adversary can
compare the rank order of different search
results, further identify the search keyword.
4.4 Notations
In this paper, notations presented in table 1 are used.
TABLE 1
Notation
diThe ith document vector, denoted as di=
{di,1,...,di,n }, where di,j represents whether the
jth keyword in the dictionary appears in document
di.
mThe number of documents in the data collection.
nThe size of dictionary DW.
CC V The collection of cluster centers vectors, denoted as
CC V ={c1,,cn}, where ciis the average vector of
all document vectors in the cluster.
CC ViThe collection of the ith level cluster center vectors,
denoted as CC Vi={vi,1,...,vi,n }where Vi,jrep-
resents the jth vector in the ith level.
DC The information of documents classification such as
document id list of a certain cluster.
DVThe collection of document vectors, denoted as DV=
d1, d2,,dm.
DWThe dictionary, denoted as Dw={w1, w2,...,wn}.
FwThe ranked id list of all documents according to their
relevance to keyword w.
IcThe clustering index which contains the encrypted
vectors of cluster centers.
IdThe traditional index which contains encrypted doc-
ument vectors.
LiThe minimum relevance score between different doc-
uments in the ith level of a cluster.
QV The query vector.
T H A fixed maximum number of documents in a cluster.
TwThe encrypted query vector for users query.
5 ARCHITECTURE AND ALGORITHM
5.1 System Model
In this section, we will introduce the MRSE-HCI
scheme. The vector space model adopted by the
MRSE-HCI scheme is same as the MRSE [19], while
the process of building index is totally different. The
hierarchical index structure is introduced into the
MRSE-HCI instead of sequence index. In MRSE-HCI,
every document is indexed by a vector. Every dimen-
sion of the vector stands for a keyword and the value
represents whether the keyword appears or not in the
document. Similarly, the query is also represented by
a vector. In the search phase, cloud server calculates
the relevance score between the query and documents
by computing the inner product of the query vector
and document vectors and return the target docu-
ments to user according to the top krelevance score.
Due to the fact that all the documents outsourced
to the cloud server is encrypted, the semantic rela-
tionship between plain documents over the encrypted
documents is lost. In order to maintain the semantic
relationship between plain documents over the en-
crypted documents, a clustering method is used to
cluster the documents by clustering their related index
vectors. Every document vector is viewed as a point
in the n-dimensional space. With the length of vectors
being normalized, we know that the distance of points
in the n-dimensional space reflect the relevance of
corresponding documents. In other word, points of
high relevant documents are very close to each other
in the n-dimensional space. As a result, we can cluster
1045-9219 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems
6
Fig. 2 MRSE-HCI architecture
the documents based on the distance measure.
With the volume of data in the data center has ex-
perienced a dramatic growth, conventional sequence
search approach will be very inefficient. To promote
the search efficiency, a hierarchical clustering method
is proposed. The proposed hierarchical approach clus-
ters the documents based on the minimum relevance
threshold at different levels, and then partitions the
resulting clusters into sub-clusters until the constraint
on the maximum size of cluster is reached. Upon
receiving a legal request, cloud server will search the
related indexes layer by layer instead of scanning all
indexes.
5.2 MRSE-HCI Architecture
MRSE-HCI architecture is depicted by Fig. 2, where
the data owner builds the encrypted index depending
on the dictionary, random numbers and secret key,
the data user submits a query to the cloud server
for getting desired documents, and the cloud server
returns the target documents to the data user. This
architecture mainly consists of following algorithms.
Keygen(1l(n))(sk, k):It is used to generate the
secret key to encrypt index and documents.
Index(D, sk)I:Encrypted index is generated
in this phase by using the above mentioned secret
key. At the same time, clustering process is also
included current phase.
Enc(D, k)E:The document collection is en-
crypted by a symmetric encryption algorithm
which achieves semantic security.
T rapdoor(w, sk)Tw: It generates encrypted
query vector Twwith users input keywords and
secret key.
Search(Tw, I, ktop )(Iw, Ew): In this phase,
cloud server compares trapdoor with index to get
the top-kretrieval results.
Dec(Ew, k)Fw:The returned encrypted docu-
ments are decrypted by the key generated in the
first step.
The concrete functions of different components is
described as below.
1) Keygen(1l(n): The data owner randomly gen-
erates a (n+u+ 1) bit vector Swhere every
element is a integer 1 or 0 and two invertible
(n+u+ 1) ×(n+u+ 1)matrices whose elements
are random integers as secret key sk.The secret
key k is generated by the data owner choosing
an n-bit pseudo sequence.
2) Index(D, sk):As show in the Fig.3, the data
owner uses tokenizer and parser to analyze
every document and gets all keywords. Then
data owner uses the dictionary Dwto trans-
form documents to a collection of document
vectors DV . Then the data owner calculates the
DC and CCV by using a quality hierarchical
clustering (QHC) method which will be illus-
trated in section C. After that, the data owner
applies the dimension-expanding and vector-
splitting procedure to every document vector. It
is worth noting that CC V is treated equally as
DV . For dimension-expanding, every vector in
DV is extended to (n+u+ 1) bit-long, where
the value in n+j(0 ju)dimension is
an integer number generated randomly and the
last dimension is set to 1. For vector-splitting,
every extended document vector is split into
two (n+u+ 1) bit-long vectors, V0and V00
with the help of the (n+u+ 1)bit vector S
as a splitting indicator. If the ith element of S
(Si) is 0, then we set V00
i=V0
i=Vi; If ith
element of S (Si) is 1, then V00
iis set to a random
number and V0
i=ViV00
i. Finally, the traditional
index Idis encrypted as Id={MT
1V0, M T
2V00}by
using matrix multiplication with the sk, and Ic
is generated in a similar way. After this, Id,Ic,
and DC are outsourced to the cloud server.
3) Enc(D, k)The data owner adopts a secure sym-
metric encryption algorithm (e.g. AES) to en-
crypt the plain document set D and outsources
it to the cloud server.
4) T rapdoor(w, sk):The data user sends the query
to the data owner who will later analyze the
query and builds the query vector QV by an-
alyzing the keywords of query with the help
of dictionary DW,QV then is extended to a
(n+u+ 1) bit query vector. Subsequently,v
random positions chosen from a range (n, n +u]
in QV are set to 1, others are set to 0.The value
at last dimension of QV is set to a random
number t[0,1].Then the first (n+u)dimensions
of QW, denoted as qw, is scaled by a random
number r(r6= 0) ,Qw= (r·qw, t). After that, Qw
is split into two random vectors as {Q0
W, Q00
W}
with vector-splitting procedure which is similar
to that in the Index(D, sk)phase. The difference
is that if the ith bit of Sis 1, then we have
q0
i=q00
i=qi; If the ith bit of Sis 0,q0
iis set
as a random number and q00
i=qiq0
i. Finally,
the encrypted query vector Twis generated as
Tw={M1
1Q0
w, M 1
2Q00
w}and sent back to the
data user.
5) Search(Tw, I, ktop ):Upon receiving the Twfrom
data user, the cloud server computes the rele-
1045-9219 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems
7
Fig. 3 Algorithm Index
Fig. 4 Algorithm Dynamic k-means
vance score between Twand index Icand then
chooses the matched cluster which has the high-
est relevance score. For every document con-
tained in the matched cluster, the cloud server
extract its corresponding encrypted document
vector in Id, and calculates its relevance score S
with Tw, as described in the Equation 1. Finally,
these scores of documents in the matched clus-
ter are sorted and the top ktop documents are
returned by the cloud server. The detail will be
discussed in the section D.
S=Tw·Ic
={M1
1Q0
w, M 1
2Q00
w}·{MT
1V0, M T
2V00}
=Q0
w·V0+Q0
wV00
=Qw·V
(1)
6) Dec(Ew, k):The data user utilizes the secret key
k to decrypt the returned ciphertext Ew.
5.3 Relevance Measure
In this paper, the concept of coordinate matching
[30]is adopted as a relevance measure. It is used
to quantify the relevance of document-query and
document-document. It is also used to quantify the
relevance of the query and cluster centers. Equation
2 defines the relevance score between document di
and query qw. Equation 3 defines the relevance score
between query qwand cluster center lci,j . Equation 4
defines the relevance score between document diand
dj.
Sqdi=
n+u+1
X
t=1
(qw,t ×di,t)(2)
Sqci=
n+u+1
X
t=1
(qw,t ×lci,j,t )(3)
Fig. 5 Algorithm Quality Hierarchical Clustering (QHC)
Fig. 6 Clustering Process
Sddi=
n+u+1
X
t=1
(di,t ×dj,t)(4)
5.4 Quality Hierarchical Clustering Algorithm
So far, a lot of hierarchical clustering methods has
been proposed. However all of these methods are
not comparable to the partition clustering method in
terms of time complexity performance. K-means[31]
and K-medois[32] are popular partition clustering al-
gorithms. But the kis fixed in the above two methods,
which can not be applied to the situation of dynamic
number of cluster centers. We propose a quality hi-
erarchical clustering (QHC) algorithm based on the
novel dynamic K-means.
As the proposed dynamic K-means algorithm shown
in the Fig.4, the minimum relevance threshold of the
clusters is defined to keep the cluster compact and
dense. If the relevance score between a document and
its center is smaller than the threshold, a new cluster
center is added and all the documents are reassigned.
The above procedure will be iterated until kis stable.
Comparing with the traditional clustering method, k
is dynamically changed during the clustering process.
This is why it is called dynamic K-means algorithm.
The QHC algorithm is illustrated in the Fig.5. It goes
like that. Every cluster will be checked on whether
its size exceeds the maximum number TH or not. If
the answer is ”yes”, this ”big” cluster will be split
into child clusters which are formed by using the
dynamic K-means on the documents of this cluster.
This procedure will be iterated until all clusters meet
the requirement of maximum cluster size. Clustering
procedure is illustrated in Fig.6. All the documents
are denoted as points in a coordinate system. These
points are initially partitioned into two clusters by
using dynamic K-means algorithm when the k= 2.
These two bigger clusters are depicted by the elliptical
shape. Then these two clusters are checked to see
whether their points satisfy the distance constraint.
The second cluster does not meet this requirement,
thus a new cluster center is added with k= 3 and
the dynamic K-means algorithm runs again to partition
1045-9219 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems
8
Fig. 7 Retrieval Process
Fig. 8 Algorithm Building-minimum hash sub-tree
the second cluster into two parts. Then the data
owner checks whether these clusters size exceed the
maximum number TH . Cluster 1 is split into two sub-
clusters again due to its big size. Finally all points are
clustered into 4 clusters as depicted by the rectangle.
5.5 Search Algorithm
The cloud server needs to find the cluster that most
matches the query. With the help of cluster index Ic
and document classification DC , the cloud server
uses an iterative procedure to find the best matched
cluster. Following instance demonstrates how to get
matched one:
1) The cloud server computes the relevance score
between Query Twand encrypted vectors of the
first level cluster centers in cluster index Ic, then
chooses the ith cluster center Ic,1,i which has the
highest score.
2) The cloud server gets the child cluster centers of
the cluster center, then computes the relevance
score between Twand every encrypted vectors
of child cluster centers, and finally gets the
cluster center Ic,2,i with the highest score. This
procedure will be iterated until that the ultimate
cluster center Ic,l,j in last level lis achieved.
In the situation depicted by Fig.7, there are 9 docu-
ments which are grouped into 3 clusters. After calcu-
lating the relevance score with trapdoor Tw, cluster
1, which is shown within the box of dummy line in
Fig.7, is found to be the best match. Documents d1,d3
,d9belong to cluster 1, then their encrypted document
vectors in the Idare extracted out to compute the
relevance score with Tw.
5.6 Search Result Verification
The retrieved data have high possibility to be wrong
since the network is unstable and the data may be
damaged due to the hardware/software failure or
malicious administrator or intruder. Verifying the au-
thenticity of search results is emerging as a critical
Fig. 9 Algorithm Processing-minimum hash sub-tree
issue in the cloud environment. We, therefore, de-
signed a signed hash tree to verify the correctness and
freshness of the search results.
Building.The data owner builds the hash tree
based on the hierarchical index structure. The
algorithm shown in the Fig.8 is described as fol-
lows. The hash value of the leaf node of the tree is
h(id kversion kΦ(id)) where id means document
id, version means document version and Φ(id)
means the document contents. The value of non-
leaf node is a pair of values (id, h(id khchild)
where id denotes the value of the cluster center
or document vector in the encrypted index, and
hchild is the hash value of its child node. The
hash value of tree root node is based on the
hash values of all clusters in the first level. It
is worth noting that the root node denotes the
data set which contains all clusters. Then the data
owner generates the signature of the hash values
of the root node and outsources the hash tree
including the root signature to the cloud server.
Cryptographic signature σ(e.g., RSA signature,
DSA signature) can be used here to authenticate
the hash value of root node.
Processing.By the algorithm shown in the Fig.9,
the cloud server returns the root signature and
the minimum hash sub-tree (MHST) to client.
The minimum hash sub-tree includes the hash
values of leaf nodes in the matched cluster and
non-leaf node corresponding to all cluster centers
used to find the matched cluster in the searching
phase.For example, in the Fig.10, the search result
is document D,Eand F. Then the leaf nodes are
D,E,Fand G, and non-leaf nodes includesC1,
C2,C3,C4,dD,dE,dF, and dG. In addition, the
root is included in the non-leaf node.
Verifying.The data owner uses the minimum hash
sub-tree to re-compute the hash values of nodes,
in particular the root node which can be further
verified by the root signature. If all nodes are
matched, then the correctness and freshness is
guaranteed. Then the data owner re-searches the
index constructed by retrieved values in MHST.
If the search result is same as the retrieved result,
the completeness, correctness and freshness all
are guaranteed.
As shown in the Fig.10, in the building phase, all
documents are clustered into 2big clusters and 4
small clusters, and each big cluster contains 2small
1045-9219 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems
9
Fig. 10 Authentication for hierarchical clustering index
clusters. The hash value of leaf node Aish(idAk
version kΦ(idA)) , the value of the non-leaf node
C3is (idC3, h(idC3khAkhBkhC), and the value of
non-leaf node C1is (idC1, h(idC1khC3khC4)) . The
other values of leaf nodes and non-leaf nodes are
generated similarly. In order to combine all first-level
clusters into a tree, a virtual root node is created by
the data owner with a hash value h(hC1,2khC2,2)
where C1,2and C2,2denotes the second part of cluster
center 1and 2respectively. Then the data owner signs
the root node, e.g., σ(h(hC1,2khC2,2)) = (hC1,2k
hC2,2, e(h(hC1,2khC2,2))k, g), and outsources it to the
cloud server.
In the processing phase, suppose that the cluster
C4is the matched cluster and the returned top-3
documents are D,E, and F. Then the minimum hash
sub-tree includes the hash values of node D,E,F,
dD,dE,dF,dG,C3,C2,C1,C4and the signed root
σ(h(hC1,2khC2,2)).
In the verifying phase, upon receiving the signed
root, the data user first check e(h(hC1,2khC2,2), g)k?
=
e(sigkh(h(hC1,2khC2,2)), g). If it is not true, the
retrieved hash tree is not authentic, otherwise the re-
turned nodes, D,E,F,dD,dE,dF,dG,C3,C2,C1,C4,
works together to verify each other and reconstruct
the hash tree. If all the nodes are authenticate, the
returned hash tree are authenticate. Then the data user
re-computes the hash value of the leaf nodes D,Eand
Fby using returned documents. These new generated
hash values are compared with the corresponding
returned hash values. If there is no difference, the
retrieved documents is correct. Finally, the data user
uses the trapdoor to re-search the index constructed
by the first part of retrieved nodes. If the search result
is same as the retrieved result, the search result is
complete.
5.7 Dynamic Data Collection
As the documents stored at server may be deleted or
modified and new documents may be added to the
original data collection, a mechanism which supports
dynamic data collection is necessary. A naive way to
address these problems is downloading all documents
and index locally and updating the data collection
and index. However, this method needs huge cost in
bandwidth and local storage space.
To avoid updating index frequently, we provide a
practical strategy to deal with insertion, deletion and
modification operations. Without loss of generality,
we use following examples to illustrate the workings
of the strategy. The data owner preserves many empty
entries in the dictionary for new documents. If a new
document contains new keywords, the data owner
first adds these new keywords to the dictionary and
then constructs a document vector based on the new
dictionary. The data owner sends the trapdoor gen-
erated by the document vector, encrypted document
and encrypted document vector to the cloud sever.
The cloud sever finds the closest cluster, and puts the
encrypted document and encrypted document vector
into it.
As every cluster has a constraint on the maximum
size, it is possible that the number of documents in
a cluster exceeds the limitation due to the insertion
operation. In this case, all the encrypted document
vectors belonging to the broken cluster are returned
to the data owner. After decryption of the retrieved
document vectors, the data owner re-builds the sub-
index based on the deciphered document vectors. The
sub-index is re-encrypted and re-outsourced to the
cloud server.
Upon receiving a deletion order, the cloud server
searches the target document. Then the cloud server
deletes the document and the corresponding docu-
ment vector.
Modifying a document can be described as deleting
the old version of the document and inserting the
new version. The operation of modifying documents,
therefore, can be realized by combining insertion op-
eration and deletion operation.
To deal with this impact on the hash tree, a lazy
update strategy is designed. For the insertion opera-
tion, the corresponding hash value will be calculated
and marked as a raw node, while the original nodes
in the hash tree will be kept unchanged because the
original hash tree still supports document verification
except the new document. Only when the new added
document is accessed, the hash tree will be updated.
Similar concept is used in the deletion operation. The
only difference is that the deletion operation will not
bring the hash tree update.
6 EFFICIENCY AND SECURITY
6.1 Search Efficiency
The search process can be divided into
T rapdoor(w, sk)phase and Search(Tw, I, ktop )phase.
The number of operation needed in T rapdoor(w, sk)
phase is illustrated as in Equation 5, where, nis the
number of keywords in the dictionary, and wis the
number of query keywords.
O(MRSE H CI)=5n+uvw+ 5 (5)
Due to the time complexity of T rapdoor(w, sk)
phase independent to DC, when DC increases expo-
nentially,it can be described as O(1).
1045-9219 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems
10
The difference of the search process between the
MRSE-HCI and the MRSE is the retrieval algorithm
used in this phase. In the Search(Tw, I, ktop )phase
of the MRSE, the cloud server needs to compute the
relevance score between the encrypted query vector
Twand all encrypted document vectors in Id, and
get the top-k ranked document list Fw. The number
of operations need in Search(TW, I, ktop )phase is
illustrated as in Equation 6, where mrepresents the
number of documents in DC ,and nrepresents the
number of keywords in the dictionary.
O(MRSE)=2m(2n+ 2u+ 1) + m1(6)
However, in the Search(TW, I, ktop)phase of MRSE-
HCI, the cloud server uses the information DC to
quickly locate the matched cluster and only com-
pares Twto a limited number of encrypted document
vectors inId.The number of operations needed in
Search(TW, I, ktop )phase is illustrated in equation 7,
where kirepresents the number of cluster centers
needed to be compared with in the ith level, and c
represents the number of document vectors in the
matched cluster.
O(MRSE H CI)=(
l
X
i=1
ki)2(2n+ 2u+ 1)
+c(2 (2n+ 2u+ 1)) + c1
(7)
When DC increases exponentially, mcan be set to 2l.
The time complexity of the traditional MRSE is O(2l),
while the time complexity of the proposed MRSE-HCI
is only O(l).
The total search time can be calculated as given
in Equation 8 below, where O(trapdoor)is O(1) ,and
O(query)relies on the DC.
O(searchT ime) = O(trapdoor) + O(query)(8)
In short, when the number of documents in DC has
an exponential growth, the search time of MRSE-
HCI increases linearly while the traditional methods
increase exponentially.
6.2 Security Analysis
To express the security analysis briefly, we adopt
some concepts from [38-40] and define what kinds of
information will be leaked to the curious-but-honest
server.
The basic information of documents and queries
are inevitably leaked to the honest-but-curious server
since all the data are stored at the server and the
queries submitted to the server. Moreover, the access
pattern and search pattern cannot be preserved in
MRSE-HCI as well as previous searchable encryption
[19] [39-41].
Definition 1 (Size Pattern) Let Dbe a document
collection. The size pattern induced by a q-query is
a tuple a(D, Q)=(m, |Q1|,· · · ,|Qq|)where mis the
number of documents and |Qi|is the size of query Qi.
Definition 2 (Access Pattern) Let Dbe a document
collection and Ibe an index over D. The access
pattern induced by a q-query is a tuple b(D, Q) =
(I(Q1), , I(Qq)), where I(Qi)is a set of identifiers
returned by query Qi, for 1iq.
Definition 3 (Search Pattern) Let Dbe a document
collection. The search pattern induced by a q-query is
am×qbinary matrix c(D, Q)such that for 1im
and 1jqthe element in the ithrow and jth
column is 1, if an document identifier idiis returned
by a query Qj.
Definition 4 (known ciphertext model secure) Let
Π = (Key gen, Index, E nc, Trapdoor, Search, Dec)be
an index-based MRSE-HCI scheme over dictionary
Dw,nN, be the security parameter, the known
ciphertext model secure experiment P rivK kcm
A,Π(n)is
described as follows.
1) The adversary submits two document collec-
tions D0and D1with the same length to a
challenger.
2) The challenger generates a secret key {sk, k}by
running Keygen(1l(n)).
3) The challenger randomly choose a bit b
{0,1}, and returns Index(Db, skb)Iband
Enc(Db, kb)Ebto the adversary.
4) The adversary outputs a bit b0
5) The output of the experiment is defined to be 1
if b0=b, and 0otherwise.
We say MRSE-HCI scheme is secure under known
ciphertext model if for all probabilistic polynomial-
time adversaries Athere exists a negligible function
negl(n)such that
P r(P riv kkcm
A,Π= 1) 1/2 + negl(n)(9)
Proof The adversary A distinguishes the document
collections depending on analyzing the secret key,
index and encrypted document collection. Then we
have equation 10, where Adv(AD({sk, k})) is the ad-
vantage for adversary A to distinguish the secret
key from two random matrixes and two random
strings, Adv(AD(I)) is the advantage to distinguish
the index from a random string and Adv(AD(E)) is
the advantage to distinguish the encrypted documents
from random strings.
P r(P riv Kkcm
A,Π(n) = 1) = 1/2+
Adv(AD(sk, k )) + Adv(AD(I)) + Adv(AD(E)) (10)
The elements of two matrixes in the secret key are
randomly chosen from {0,1}l(n), and the split indica-
tor Sand key kare also chosen uniformly at random
from {0,1}l(n). Given {0,1}l(n),Adistinguishes the se-
cret key from two random matrixes and two random
strings with a negligible probability. Then there exits
a negligible function negl1(n)such that
1045-9219 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems
11
Adv(AD(sk, k )) = |Pr (Keygen(1l(n))(sk, k))
P r(Random (skr, kr))| ≤ negl1(n)
(11)
where skrdenotes two random matrixes and a ran-
dom string, and kris a random string. In our scheme,
the encryption of hierarchical index is essential to
encrypt all the document vectors and cluster center
vectors. All the cluster center vectors are treated as
document vectors in the encryption phase. Eventually,
all the document vectors and cluster center vectors are
encrypted by the secure KNN. As the secure KNN is
known plaintext attack (KPA) secure [33], the hier-
archical index is secure under the known ciphertext
model. Then there exists a negligible function negl2(n)
satisfying that
Adv(AD(I)) = |P r(I ndex(D, sk)(I))
P r(Random (Ir))| ≤ negl2(n)(12)
where Iris a random string.
Since the encryption algorithm used to encrypt
Dbis semantic secure, the encrypted documents are
secure under known ciphertext model. Then there
exists a negligible function negl3(n)such that
Adv(AD(E)) = |P r(Enc(D, k)(E))
P r(Random (Er))| ≤ negl3(n)(13)
Where Eris a random string set.
According equation 10, 11, 12 and 13, we can get
equation 14.
P r(P riv kkcm
A,Π= 1) 1/2+
negl1(n) + negl2(n) + negl3(n)(14)
negl(n) = negl1(n) + negl2(n) + negl3(n)(15)
P r(P riv kkcm
A,Π)1/2 + negl(n)(16)
By combining equation 14 and 15, we can conclude
equation 16. Then, we say MRSE-HCI is secure under
know ciphertext model.
7 EVALUATION METHOD
7.1 Search precision
The search precision can quantify the users satisfac-
tion. The Retrieval precision is related to two fac-
tors: the relevance between documents and the query,
and the relevance of documents between each other.
Equation 17 defines the relevance between retrieved
documents and the query.
Pq=
k0
X
i=1
S(qw, di)/(
k
X
i=1
S(qw, di)) (17)
Here, k0denotes the number of files retrieved by the
evaluated method , kdenotes the number of files
retrieved by plain text search, qw represents query
vector, direpresents document vector, and Sis a
function to compute the relevance score between qw
and di. Equation 18 defines the relevance of different
retrieved documents.
Pd=
k0
X
j=1
k0
X
i=1
S(dj, di)/(
k
X
j=1
k
X
i=1
S(dj, di)) (18)
Here, k0denotes the number of files retrieved by
the evaluated method, kdenotes the number of files
retrieved by plaintext search, and both dianddjdenote
document vector.
Equation 19 combines the relevance between query
and retrieved documents and relevance of documents
to quantify the search precision such that
Acc =aPq+Pd(19)
where afunctions as a tradeoff parameter to balance
the relevance between query and documents and rele-
vance of documents. If ais bigger than 1, it puts more
emphasis on the relevance of documents otherwise
query keywords.
The above evaluation strategies should be based on
the same dataset and keywords.
7.2 Rank Privacy
Rank privacy can quantify the information leakage of
the search results. The definition of rank privacy is
adopted from [19]. Equation 20 is used to evaluate
the rank privacy.
Pk=
k
X
i=1
Pi/k (20)
Here, kdenotes the number of top-k retrieved doc-
uments, pi=|ci0ci|,ci0is the ranking of document
diin the retrieved top-k documents,ciis the actual
ranking of document diin the data set, and Piis set to
kif greater than k. The overall rank privacy measure
at point k, denoted as Pk, is defined as the average
value of pifor every document diin the retrieved top-
k documents.
8 PERFORMANCE ANALYSIS
In order to test the performance of MRSE-HCI on
real dataset, we built an experimental platform to
test the search efficiency, accuracy and rank privacy.
We implemented the target experiment based on a
distributed platform which includes three ThinkServer
RD830 and a ThinkCenter M8400t. The data set is built
from IEEE Xplore, including about 51000 documents,
and 22000 keywords.
According to the notations defined in section IV,
ndenotes the dictionary size, kdenotes the number
1045-9219 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems
12
(a)Search time with the in-
creasing documents
(b)Search time with the in-
creasing number of retrieved
documents
(c)Search time with the in-
creasing number of query
keywords
Fig. 11 search efficiency
of top-kdocuments, mdenotes the number of docu-
ments in the data set, and wdenotes the number of
keywords in the users query.
Fig.11 is used to describe search efficiency with
different conditions. Fig.11 (a) describes search ef-
ficiency using the different size of document set
with unchanged dictionary size, number of re-
trieved documents and number of query keywords,
n= 22157, k = 20, w = 5. In Fig.11 (b), we adjust the
value of kwith unchanged dictionary size, doc-
ument set size and number of query keywords,
n= 22157, m = 51312, w = 5. Fig.11 (c) tests the dif-
ferent number of query keywords with unchanged
dictionary size, document set size and number of
retrieved documents, n= 22157, m = 51312, k = 20.
From the Fig.11 (a), we can observe that with the ex-
ponential growth of document set size, the search time
of MRSE increases exponentially, while the search
time of M RSE HCI increases linearly. As the Fig.11
(b) and (c) shows, the search time of MRSE HCI
keeps stable with the increase of query keywords and
retrieved documents. Meanwhile, the search time is
far below that of MRSE.
Fig.12 describes search accuracy by utilizing plain-
text search as a standard. Fig.12 (a) illustrates the
relevance of retrieved documents. With the number of
documents increases from 3200 to 51200, the ratio of
MRSE-to-plaintext search fluctuates at 1, while MRSE-
HCI-to-plaintext search increases from 1.5to 2. From
the Fig.12 (a), we can observe that the relevance of
retrieved documents in the MRSE-HCI is almost twice
as many as that in the MRSE, which means retrieved
documents generated by MRSE-HCI are much closer
to each other. Fig.12 (b) shows the relevance between
query and retrieved documents. With the size of
document set increases from 3200 to 51200, the MRSE-
to-plaintext search ratio fluctuates at 0.75.MRSE-HCI-
to-plaintext search ratio increases from 0.65 to 0.75
(a)Relevance of documents (b) Relevance between docu-
ments and query
(c) Overall evaluation
Fig. 12 Search precision
Fig. 13 Rank privacy
accompanying with the growth of document set size.
From the Fig.12 (b), we can see that the relevance
between query and retrieved documents in MRSE-
HCI is slightly lower than that in MRSE. Especially,
this gap narrows when the data size increases since
a big document data set has a clear category distri-
bution which improves the relevance between query
and documents. Fig.12 (c) shows the rank accuracy ac-
cording to equation 19. The tradeoff parameter ais set
to 1, which means there is no bias towards relevance
of documents or relevance between documents and
query. From the result, we can conclude that MRSE-
HCI is better than MRSE in rank accuracy.
Fig. 13 describes the rank privacy according to
equation 20. In this test, no matter the number of
retrieved documents, MRSE HCI has better rank
privacy than MRSE. This mainly caused by the rele-
vance of documents introduced into search strategy.
9 CONCLUSION
In this paper, we investigated ciphertext search in
the scenario of cloud storage. We explore the prob-
lem of maintaining the semantic relationship between
different plain documents over the related encrypted
documents and give the design method to enhance
the performance of the semantic search. We also
propose the MRSE-HCI architecture to adapt to the
requirements of data explosion, online information
retrieval and semantic search. At the same time, a
verifiable mechanism is also proposed to guarantee
1045-9219 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems
13
the correctness and completeness of search results. In
addition, we analyze the search efficiency and security
under two popular threat models. An experimental
platform is built to evaluate the search efficiency,
accuracy, and rank security. The experiment result
proves that the proposed architecture not only prop-
erly solves the multi-keyword ranked search problem,
but also brings an improvement in search efficiency,
rank security, and the relevance between retrieved
documents.
10 ACKNOWLEDGEMENT
This work was supported by Strategic Priority Re-
search Program of Chinese Academy of Sciences
(No.XDA06040602) and Xinjiang Uygur Autonomous
Region science and technology plan (No.201230121).
REFERENCES
[1] S. Grzonkowski, P. M. Corcoran, and T. Coughlin, ”Security
analysis of authentication protocols for next-generation mobile
and CE cloud services,” in Proc. ICCE, Berlin, Germany, 2011,
pp. 83-87.
[2] D. X. D. Song, D. Wagner, and A. Perrig, ”Practical techniques
for searches on encrypted data,” in Proc. S &P, BERKELEY,
CA, 2000, pp. 44-55.
[3] D. Boneh, G. Di Crescenzo, R. Ostrovsky, and G. Persiano,
”Public key encryption with keyword search,” in Proc. EURO-
CRYPT, Interlaken, SWITZERLAND, 2004, pp. 506-522.
[4] Y. C. Chang, and M. Mitzenmacher, ”Privacy preserving key-
word searches on remote encrypted data,” in Proc. ACNS,
Columbia Univ, New York, NY, 2005, pp. 442-455.
[5] R. Curtmola, J. Garay, S. Kamara, and R. Ostrovsky, ”Search-
able symmetric encryption: improved definitions and efficient
constructions,” in Proc. ACM CCS, Alexandria, Virginia, USA,
2006, pp. 79-88.
[6] M. Bellare, A. Boldyreva, and A. O’Neill, ”Deterministic and
efficiently searchable encryption,” in Proc. CRYPTO, Santa Bar-
bara, CA, 2007, pp. 535-552.
[7] D. Boneh, and B. Waters, ”Conjunctive, subset, and range
queries on encrypted data,” in Proc. TCC, Amsterdam,
NETHERLANDS, 2007, pp. 535-554.
[8] D. X. D. Song, D. Wagner, and A. Perrig, ”Practical techniques
for searches on encrypted data,” in Proc. S &P 2000, BERKE-
LEY, CA, 2000, pp. 44-55.
[9] E.-J. Goh, Secure Indexes, IACR Cryptology ePrint Archive, vol.
2003, pp. 216. 2003.
[10] C. Wang, N. Cao, K. Ren, and W. J. Lou, Enabling Secure and
Efficient Ranked Keyword Search over Outsourced Cloud Data,
IEEE Trans. Parallel Distrib. Syst., vol. 23, no. 8, pp. 1467-1479,
Aug. 2012.
[11] A. Swaminathan, Y. Mao, G. M. Su, H. Gou, A. Varna, S. He, M.
Wu, and D. Oard, ”Confidentiality-Preserving Rank-Ordered
Search,” in Proc. ACM StorageSS, Alexandria, VA, 2007, pp.
7-12.
[12] S. Zerr, D. Olmedilla, W. Nejdl, and W. Siberski, ”Zerber+R:
top-k retrieval from a confidential index,” in Proc. EDBT, Saint
Petersburg, Russia, 2009, pp. 439-449.
[13] C. Wang, N. Cao, J. Li, K. Ren, and W. J. Lou, ”Secure Ranked
Keyword Search over Encrypted Cloud Data,” in Proc. ICDCS,
Genova, ITALY, 2010.
[14] P. Golle, J. Staddon, and B. Waters, ”Secure conjunctive key-
word search over encrypted data,” in Proc. ACNS, Yellow Mt,
China, 2004, pp. 31-45.
[15] L. Ballard, S. Kamara, and F. Monrose, ”Achieving efficient
conjunctive keyword searches over encrypted data,” in Proc.
ICICS, Beijing, China, 2005, pp. 414-426.
[16] R. Brinkman, Searching in encrypted data: University of
Twente, 2007.
[17] Y. H. Hwang, and P. J. Lee, ”Public key encryption with
conjunctive keyword search and its extension to a multi-user
system,” in Proc. Pairing, Tokyo, JAPAN, 2007, pp. 2-22.
[18] H. Pang, J. Shen, and R. Krishnan, Privacy-Preserving
Similarity-Based Text Retrieval, ACM Trans. Internet. Technol.,
vol. 10, no. 1, pp. 39, Feb. 2010.
[19] N. Cao, C. Wang, M. Li, K. Ren, and W. J. Lou, ”Privacy-
Preserving Multi-keyword Ranked Search over Encrypted
Cloud Data,” in Proc. IEEE INFOCOM, Shanghai, China, 2011,
pp. 829-837.
[20] W. Sun, B. Wang, N. Cao, M. Li, W. Lou, Y. T. Hou, and
H. Li, ”Privacy-preserving multi-keyword text search in the
cloud supporting similarity-based ranking,” in Proc. ASIACCS,
Hangzhou, China, 2013, pp. 71-82.
[21] F. Li, M. Hadjieleftheriou, G. Kollios, and L. Reyzin, ”Dynamic
authenticated index structures for outsourced databases,” in
Proc. ACM SIGMOD, Chicago, IL, USA, 2006, pp. 121-132.
[22] H. H. Pang, and K. L. Tan, ”Authenticating query results in
edge computing,” in Proc. ICDE, Boston, MA, 2004, pp. 560-571.
[23] C. Martel, G. Nuckolls, P. Devanbu, M. Gertz, A. Kwong,
and S. G. Stubblebine, A general model for authenticated data
structures, Algorithmica, vol. 39, no. 1, pp. 21-41, May. 2004.
[24] C. M. Ralph, ”Protocols for Public Key Cryptosystems,” in
Proc. S &P, Oakland, CA, 1980, pp. 122-122.
[25] R. C. Merkle, A CERTIFIED DIGITAL SIGNATURE, Lect.
Notes Comput. Sci., vol. 435, pp. 218-238. 1990.
[26] M. Naor, and K. Nissim, Certificate revocation and certificate
update, IEEE J. Sel. Areas Commun., vol. 18, no. 4, pp. 561-570,
Apr. 2000.
[27] H. Pang, and K. Mouratidis, Authenticating the query results
of text search engines, Proc. VLDB Endow., vol. 1, no. 1, pp.
126-137, Aug. 2008.
[28] C. Chen, X. J. Zhu, P. S. Shen, and J. K. Hu, ”A Hierarchical
Clustering Method For Big Data Oriented Ciphertext Search,”
presented at Proc. BigSecurity, Toronto, Canada, Apr. 27-May.
2, 2014.
[29] S. C. Yu, C. Wang, K. Ren, and W. J. Lou, ”Achieving Secure,
Scalable, and Fine-grained Data Access Control in Cloud Com-
puting,” in Proc. IEEE INFOCOM, San Diego, CA, 2010, pp.
1-9.
[30] I. H. Witten, A. Moffat, and T. C. Bell, Managing gigabytes:
compressing and indexing documents and images, 2nd ed., San
Francisco: Morgan Kaufmann, 1999.
[31] J. MacQueen, ”Some methods for classification and analysis of
multivariate observations,” in Proc. Berkeley Symp. Math. Stat.
Prob, California, USA, 1967, p. 14.
[32] Z. X. Huang, Extensions to the k-means algorithm for cluster-
ing large data sets with categorical values, Data Min. Knowl.
Discov., vol. 2, no. 3, pp. 283-304, Sep. 1998.
[33] W. K. Wong, D. W. Cheung, B. Kao, and N. Mamoulis, ”Secure
kNN Computation on Encrypted Databases,” in Proc. ACM
SIGMOD, Providence, RI, 2009, pp. 139-152.
[34] R. X. Li, Z. Y. Xu, W. S. Kang, K. C. Yow, and C. Z. Xu,
Efficient multi-keyword ranked query over encrypted data in
cloud computing, Futur. Gener. Comp. Syst., vol. 30, pp. 179-
190, Jan. 2014.
[35] G. Craig. ”Fully homomorphic encryption using ideal lattices.”
STOC. Vol. 9. 2009
[36] D. Boneh, G. Di Crescenzo, R. Ostrovsky, et al. Public key
encryption with keyword search[C].Advances in Cryptology-
Eurocrypt 2004. Springer Berlin Heidelberg, 2004: 506-522.
[37] D. Cash, S. Jarecki, C. Jutla, et al. Highly-scalable search-
able symmetric encryption with support for boolean que-
ries[M].Advances in CryptologyCRYPTO 2013. Springer Berlin
Heidelberg, 2013: 353-373.
[38] S. Kamara, C. Papamanthou, T.Roeder. Dynamic searchable
symmetric encryption[C].Proceedings of the 2012 ACM con-
ference on Computer and communications security. ACM, 2012:
965-976.
[39] Curtmola R, Garay J, Kamara S, et al. Searchable symmet-
ric encryption: improved definitions and efficient construc-
tions[C]//Proceedings of the 13th ACM conference on Com-
puter and communications security. ACM, 2006: 79-88.
[40] Chase, M., and Kamara, S. (2010). Structured encryption and
controlled disclosure. In Advances in Cryptology-ASIACRYPT
2010 (pp. 577-594). Springer Berlin Heidelberg
1045-9219 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems
14
[41] Cash, D., Jaeger, J., Jarecki, S., Jutla, C., Krawczyk, H., Rosu,
M. C., and Steiner, M. (2014). Dynamic searchable encryption
in very large databases: Data structures and implementation.
In Proc. of NDSS (Vol. 14)
[42] Jarecki, S., Jutla, C., Krawczyk, H., Rosu, M., and Steiner,
M. (2013, November). Outsourced symmetric private infor-
mation retrieval. In Proceedings of the 2013 ACM SIGSAC
conference on Computer and communications security (pp. 875-
888). ACM.
Chi Chen (M14) received the B.S.(2000) and
M.S.(2003) degree from Shandong Univer-
sity, Jinan, China. He received PH.D.(2008)
degree from Institute of Software Chinese
Academy of Sciences, Beijing, China. He is
associate research fellow of Institute of In-
formation Engineering, Chinese Academy of
Sciences. His research interest includes the
cloud security and database security. From
2003 to 2011, he was a research apprentice,
research assistant and associate research
fellow with the State Key Laboratory of Information Security, institute
of software Chinese Academy of Sciences. Since 2012, he is an as-
sociate research fellow with the State Key Laboratory of Information
Security, institute of information engineering, Chinese Academy of
Sciences, Beijing, China.
Xiaojie Zhu received the B.S. degree in Zhe-
jiang University of Technology, HangZhou,
China, in 2011.He is currently pursuing the
MS degree in Institute of Information Engi-
neering, Chinese Academy of Sciences. His
research interest includes the information re-
trieval, secure cloud storage and data secu-
rity.
Peisong Shen received the B.S. degree
in University of Science and Technology of
China, HeFei, China, in 2012. He is currently
pursuing the Ph.D. degree in Institute of In-
formation Engineering, Chinese Academy of
Sciences. His research interest includes the
information retrieval, secure cloud storage
and data security.
Jiankun Hu is a Professor and Research Di-
rector of Cyber Security Lab, The University
of New South Wales, Canberra, Australia. He
has obtained 7 ARC (Australian Research
Council) Grants and is now serving at the
prestigious Panel of Mathematics, Informa-
tion and Computing Sciences, ARC ERA
Evaluation Committee.
Song Guo received his Ph.D. in computer
science from University of Ot-tawa, Canada.
From 2001 to 2006, he worked as chief soft-
ware architect for Liska Biometry Inc., NH,
USA. Dr. Guo also held a position with the
Department of Electrical and Computer Engi-
neering, the University of British Columbia on
a prestigious NSERC (Natural Sciences and
Engineering Research Council of Canada)
Postdoctoral Fellowship in 2006. From 2006
to 2007, he was an Assistant Professor at
the Department of Computer Science, University of Northern British
Columbia, Canada. He is currently a Full Professor with the School
of Computer Science and Engineering, the University of Aizu,Japan.
Zahir Tari received the degree in mathemat-
ics from University of Science and Technol-
ogy Houari Boumediene, Bab-Ezzouar, Al-
geria, in 1984, the Masters degree in op-
erational research from the Uni-versity of
Grenoble, Grenoble, France, in 1985, and
the PhD degree in computer science from
the University of Grenoble, in 1989. He is a
Professor (in distributed systems) at RMIT
Univer-sity, Melbourne, Australia. Later, he
joined the Database Labora-tory at EPFL
(Swiss Federal Institute of Technology, 1990-1992) and then moved
to QUT (Queensland University of Technology, 1993-1995) and
RMIT (Royal Melbourne Institute of Technology, since 1996). He is
the Head of the DSN (Distributed Systems and Networking) at the
School of Computer Scienceand IT, where he pursues high impact
research and development incomputer science. He leads a few
research groups that focus on some of the core areas, including net-
working (QoS routing, TCP/IP conges-tion),distributed systems (per-
formance, security, mobility, relia-bility), and distributed applications
(SCADA, Web/Internet ap-plications, mobile applications).His recent
research interests are in performance (in Cloud) and security (in
SCADA systems). Dr. Tari regularly publishes in prestigious journals
(like IEEE Transactions on Parallel and Distributed Systems, IEEE
Trans-actions on Web Services, ACM Transactions on Databases)
and conferences (ICDCS, WWW, ICSOC etc.). He co-authored two
books (John Wiley) and edited more than 10 books. He has been
the Program Committee Chair of several international conferences,
including the DOA (Distributed Object and Appli-cation Symposium),
IFIP DS 11.3 on Database Security, and IFIP 2.6 on Data Semantics.
He has also been the General Chair of more than 12 conferences.
He is the recipient of 14 ARC (Australian ResearchCouncil) grants.
He is a senior member of the IEEE.
Albert Y. Zomaya is currently the Chair Pro-
fessor of High Performance Computing and
Networking and Australian Research Council
Professorial Fellow in the School of Informa-
tion Technologies, The University of Sydney,
Sydney, Australia. He is also the Director
of the Centre for Distributed and High Per-
formance Computing which was established
in late 2009. He is the author/co-author of
seven books, more than 370 papers, and
the editor of nine books and 11 conference
proceedings. Prof. Zomaya is the Editor in Chief of the IEEE Trans-
actions on Computers and serves as an Associate Editor for 19
leading journals. He is the recipient of the Meritorious Service Award
(in 2000) and the Golden Core Recognition (in 2006), both from
the IEEE Computer Society. He is a Chartered Engineer (CEng),
a Fellow of the AAAS, the IEEE, the IET (UK), and a Distinguished
Engineer of the ACM.
... The completeness of the search means that the retrieved data has not been tampered with. In addition, Chen et al. [9] proposed an authenticated Merkle hash tree to verify the search result. Although significant progress has been made by the existing constructions [8] [9], the verifiable property comes at the high cost of extra storage and computation. ...
... Finally, with the availability of GPUs and TPUs, the requirement of parallelism is essential. Although efficient SSE constructions are available [6], [9], existing solutions are still highly sequential. ...
... Thereafter, Curtmola et al. [7] formalized the security definition of SSE and proposed two constructions that corresponded to nonadaptive semantic security and adaptive semantic security by assuming the the existence of a pseudo-random permutation and an encryption algorithm that provides security against chosen plaintext attacks. Following these definitions, various SSE schemes have been proposed to enrich queries and enhance search efficiency, such as ranked keyword search [18] [9], fuzzy keyword search [19], similarity search [20], semantic search [21], and parallel search [22]. ...
Article
Full-text available
Cloud service models intrinsically cater to multiple tenants. In current multi-tenancy model, cloud service providers isolate data within a single tenant boundary with no or minimum cross-tenant interaction. With the booming of cloud applications, allowing a user to search across tenants is crucial to utilize stored data more effectively. However, conducting such a search operation is inherently risky, primarily due to privacy concerns. Moreover, existing schemes typically focus on a single tenant and are not well suited to extend support to a multi-tenancy cloud, where each tenant operates independently. In this article, to address the above issue, we provide a privacy-preserving, verifiable, accountable, and parallelizable solution for “privacy-preserving keyword search problem" among multiple independent data owners. We consider a scenario in which each tenant is a data owner and a user’s goal is to efficiently search for granted documents that contain the target keyword among all the data owners. We first propose a verifiable yet accountable keyword searchable encryption (VAKSE) scheme through symmetric bilinear mapping. For verifiability, a message authentication code (MAC) is computed for each associated piece of data. To maintain a consistent size of MAC, the computed MACs undergo an exclusive OR operation. For accountability, we propose a keyword-based accountable token mechanism where the client’s identity is seamlessly embedded without compromising privacy. Furthermore, we introduce the parallel VAKSE scheme, in which the inverted index is partitioned into small segments and all of them can be processed synchronously. We also conduct formal security analysis and comprehensive experiments to demonstrate the data privacy preservation and efficiency of the proposed schemes, respectively.
... Description of data is done only where attributes match file attributes. [15,19] Cipher text policy attributes-based encryption provides one to several flexible access controls. In this strategy each document is encoded independently and their encryption effectiveness expanded by hierarchical property based encryption scheme. ...
... The collection of documents can be encrypted and generate an included access tree and rather than each and every encryption. In [14,15] both cipher text storage and time cost of encryption or decryption are stored. The proposed approach is demonstrated hypothetically and its proficiency for effective looking of the encrypted documents and make a record structure for the collection of documents. ...
... Guo etal [15] proposed a leakage hierarchical attribute-based encryption technique to protect next to the input data outflow attacks .Cao etal [17] propose a multi-key ranked search scheme using secure K-Nearest Neighbour algorithm. The collection of privacy requirements are recognized and there are two schemes are proposed to increase searching efficiency and security .Li etal [5] proposed a new ABE scheme which can execute keyword search function. ...
... Chi Chen and Xiaojie Zhu [7] used a hierarchical clustering method to maintain the close relationship between plain documents and encrypted documents to increase search efficiency within a big data environment. They also used a coordinate matching technique [8] to measure the relevance score between query and document. ...
... From chart 1, we can see that the time needed to search the documents increases when the size of dataset increases. Compared with the previous related work[7] time needed to search the documents is less. ...
Article
Full-text available
Cloud computing provides the facility to store and manage data remotely. The volume of information is increasing per day. The owners choose to store the sensitive data on the cloud storage. To protect the data from unauthorized accesses, the data must be uploaded in encrypted form. Due to large amount of information is stored on the cloud storage; the association between the documents is hiding when the documents are encrypted. It is necessary to design a search technique which gives the results on the basis of the similarity values of the encrypted documents. In this paper a cosine similarity clustering method is proposed to make the clusters of similar documents based on the cosine values of the document vectors. We also proposed a MRSE-CSI model used to search the documents which are in encrypted form. The proposed search technique only finds the cluster of documents with the highest similarity value instead of searching on the whole dataset. Processing the dataset on two algorithms shows that the time needed to form the clusters in the proposed method is less. When the documents in the dataset increases, the time needed to form clusters also increases. The result of the search shows that increasing the documents also increases the search time of the proposed method.
... [47]. Similar to the models in [48], FCC calls for better data privacy, despite good trustworthiness. ...
Article
Full-text available
To solve the security problems of the moving robot system in the fog network of the Industrial Internet of Things (IIoT), this paper presents a privacy‐preserving data integration scheme in the moving robot system. First, a novel data collection enhancement algorithm is proposed to enhance the image effects, and a k‐anonymous location and data privacy protection protocol based on Ad hoc network (Ad hoc‐based KLDPP protocol) is designed in secure data collection phase to protect the privacy of location and network data. Second, the secure multiparty computation with verifiable key sharing is introduced to realize the valid computation against share cheating in the robot system. Third, the ciphertext classification method in a neural network is considered in the secure data storage process to realize the special application. Finally, experiments and simulations are conducted on the robot system of fog computing in the IIoT. The results demonstrate that the proposed scheme can improve the security and efficiency of the said robot system.
... ES 3 DBMS attains enhanced security by utilizing the learning with errors (LWE)-based secure kNN algorithm to encrypt features indices [31]. This approach guarantees strong privacy protection for the underlying feature vectors. ...
Article
Full-text available
Deep learning-based semantic search (DLSS) aims to bridge the gap between experts and non-experts in search. Experts can create precise queries due to their prior knowledge, while non-experts struggle with specific terms and concepts, making their queries less precise. Cloud infrastructure offers a practical and scalable platform for data owners to upload their data, making it accessible to intended data users. However, the contemporary single-owner/single-user (S/S) approach to DLSS schemes falls short of effectively leveraging the inherent multi-user capabilities of cloud infrastructure. Furthermore, most of these schemes delegate the dissemination of secret keys to a single trust point within the mutual distrust scenario in cloud infrastructure. This paper proposes a Secure Semantic Search using Deep Learning in a Blockchain-Assisted Multi-User Setting (S3DBMS)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(S^3DBMS)$$\end{document}. Specifically, the seamless integration of attribute-based encryption with transfer learning allows the construction of DLSS in multi-owner/multi-user (M/M) settings. Further, blockchain’s smart contract mechanism allows a multi-attribute authority consensus-based generation of user private keys and system-wide global parameters in a mutual distrust M/M scenario. Finally, our scheme achieves privacy requirements and offers improved security and accuracy.
Article
This research aims to enhance photo encryption security by developing a sophisticated technique. This method uses homomorphic encryption to address challenges in encrypting visible spectrum pictures. Each red–green–blue (RGB) channel of the image is divided into smaller sub-values, encrypted separately using an optimized homomorphic encryption algorithm, and then combined for further encryption. Additionally, a novel approach involves combining surrounding pixels to embed extra data during encryption. The process allows for compression and decompression of encrypted components for easier storage or transmission. After decryption, the initial pixel values are recovered, removing any unnecessary data and condensing each channel's pixel intensity into just two sub-values. Multiple security evaluations confirm the method's robustness and resistance, emphasizing its strong security features for encrypted images.
Article
Recently, the Convolutional Neural Network (CNN) based Content-Based Image Retrieval (CBIR) has substantially improved the search accuracy of encrypted images. Further, the increasing trends in outsourcing the CNN-based CBIR service to the cloud relieve the users from severe computation and storage requirements. However, all of the existing CNN-based CBIR schemes lack the support for Multi-owner multi-user settings and thus significantly limit the flexibility and scalability of cloud computing. To fill this gap, we propose a V erifiable P rivacy-preserving I mage R etrieval scheme in the M ulti-owner multi-user setting (VPIRM). VPIRM utilizes a two-phase transfer learning technique. In the first phase, convolution base transfer takes the pre-trained CNN model for feature extraction, which addresses the issue of scarce training data at the image owner (IO) side. In the second phase, novel secure transfer enables the image user (IU) to construct a query feature vector over the same feature space on which the model is trained. Meanwhile, our scheme simultaneously supports fine-grained access control, dynamic updates, and results correctness and completeness on a malicious cloud server. Finally, a thorough security analysis shows that the scheme achieves various privacy requirements under the known-ciphertext and known-background threat model.
Article
Outsourcing data to the cloud has become prevalent, so Searchable Symmetric Encryption (SSE), one of the methods for protecting outsourced data, has arisen widespread interest. Moreover, many novel technologies and theories have emerged, especially for the attacks on SSE and privacy-preserving. But most surveys related to SSE concentrate on one aspect (e.g., single keyword search, fuzzy keyword search, etc.) or lack in-depth analysis. Therefore, we revisit the existing work and conduct a comprehensive analysis and summary. We provide an overview of state of the art in SSE and focus on the privacy it can protect. Generally, (1) we study the work of the past few decades and classify SSE based on query expressiveness. Meanwhile, we summarize the existing schemes and analyze their performance on efficiency, storage space, index structures, etc.; (2) we complement the gap in the privacy of SSE and introduce in detail the attacks and the related defenses; (3) we discuss the open issues and challenges in existing schemes and future research directions. We desire that our work will help novices to grasp and understand SSE comprehensively. We expect it can inspire the SSE community to discover more crucial leakages and design more efficient and secure constructions.
Conference Paper
Full-text available
With the increasing popularity of cloud computing, huge amount of documents are outsourced to the cloud for reduced management cost and ease of access. Although en-cryption helps protecting user data confidentiality, it leaves the well-functioning yet practically-efficient secure search functions over encrypted data a challenging problem. In this paper, we present a privacy-preserving multi-keyword text search (MTS) scheme with similarity-based ranking to address this problem. To support multi-keyword search and search result ranking, we propose to build the search index based on term frequency and the vector space model with cosine similarity measure to achieve higher search result accuracy. To improve the search efficiency, we propose a tree-based index structure and various adaption methods for multi-dimensional (MD) algorithm so that the practical search efficiency is much better than that of linear search. To further enhance the search privacy, we propose two secure index schemes to meet the stringent privacy requirements under strong threat models, i.e., known ciphertext model and known background model. Finally, we demonstrate the effectiveness and efficiency of the proposed schemes through extensive experimental evaluation.
Conference Paper
Full-text available
Following the wide use of cloud services, the volume of data stored in the data center has experienced a dramatically growth which makes real-time information retrieval much more difficult than before. Furthermore, text information is usually encrypted before being outsourced to data centers in order to protect users' data privacy. Current techniques to search on encrypted data do not perform well within such a massive data environment. In this paper, a hierarchical clustering method for ciphertext search within a big data environment is proposed. The proposed approach clusters the documents based on the minimum similarity threshold, and then partitions the resultant clusters into sub-clusters until the constraint on the maximum size of cluster is reached. In the search phase, this approach can reach a linear computational complexity against exponential size of document collection. In addition, retrieved documents have a better relationship with each other than traditional methods. An experiment has been conducted using the collection set built from the recent ten years' IEEE INFOCOM publications, including about 3000 documents with nearly 5300 keywords. The results have validated our proposed approach.
Article
Full-text available
This work presents the design and analysis of the first searchable symmetric encryption (SSE) protocol that supports conjunctive search and general Boolean queries on outsourced symmetrically- encrypted data and that scales to very large databases and arbitrarily-structured data including free text search. To date, work in this area has focused mainly on single-keyword search. For the case of conjunctive search, prior SSE constructions required work linear in the total number of documents in the database and provided good privacy only for structured attribute-value data, rendering these solutions too slow and inflexible for large practical databases. In contrast, our solution provides a realistic and practical trade-off between performance and privacy by efficiently supporting very large databases at the cost of moderate and well-defined leakage to the outsourced server (leakage is in the form of data access patterns, never as direct exposure of plaintext data or searched values). We present a detailed formal cryptographic analysis of the privacy and security of our protocols and establish precise upper bounds on the allowed leakage. To demonstrate the real-world practicality of our approach, we provide performance results of a prototype applied to several large representative data sets, including encrypted search over the whole English Wikipedia (and beyond).
Conference Paper
Full-text available
In the setting of searchable symmetric encryption (SSE), a data owner D outsources a database (or document/file collection) to a remote server E in encrypted form such that D can later search the collection at E while hiding information about the database and queries from E. Leakage to E is to be confined to well-defined forms of data-access and query patterns while preventing disclosure of explicit data and query plaintext values. Recently, Cash et al. presented a protocol, OXT, which can run arbitrary boolean queries in the SSE setting and which is remarkably efficient even for very large databases. In this paper we investigate a richer setting in which the data owner D outsources its data to a server E but D is now interested to allow clients (third parties) to search the database such that clients learn the information D authorizes them to learn but nothing else while E still does not learn about the data or queried values as in the basic SSE setting. Furthermore, motivated by a wide range of applications, we extend this model and requirements to a setting where, similarly to private information retrieval, the client's queried values need to be hidden also from the data owner D even though the latter still needs to authorize the query. Finally, we consider the scenario in which authorization can be enforced by the data owner D without D learning the policy, a setting that arises in court-issued search warrants. We extend the OXT protocol of Cash et al. to support arbitrary boolean queries in all of the above models while withstanding adversarial non-colluding servers (D and E) and arbitrarily malicious clients, and while preserving the remarkable performance of the protocol.
Article
Full-text available
With the growing popularity of cloud computing, huge amount of documents are outsourced to the cloud for reduced management cost and ease of access. Although encryption helps protecting user data confidentiality, it leaves the well-functioning yet practically-efficient secure search functions over encrypted data a challenging problem. In this paper, we present a verifiable privacy-preserving multi-keyword text search (MTS) scheme with similarity-based ranking to address this problem. To support multi-keyword search and search result ranking, we propose to build the search index based on term frequency- and the vector space model with cosine similarity measure to achieve higher search result accuracy. To improve the search efficiency, we propose a tree-based index structure and various adaptive methods for multi-dimensional (MD) algorithm so that the practical search efficiency is much better than that of linear search. To further enhance the search privacy, we propose two secure index schemes to meet the stringent privacy requirements under strong threat models, i.e., known ciphertext model and known background model. In addition, we devise a scheme upon the proposed index tree structure to enable authenticity check over the returned search results. Finally, we demonstrate the effectiveness and efficiency of the proposed schemes through extensive experimental evaluation.
Article
We study-the setting in which a user stores encrypted documents (e.g. e-mails) on an untrusted server. In order to retrieve documents satisfying a certain search criterion, the user gives the server a capability that allows the server to identify exactly those documents. Work in this area has largely focused on search criteria consisting of a single keyword. If the user is actually interested in documents containing each of several keywords (conjunctive keyword search) the user must either give the server capabilities for each of the keywords individually and rely on an intersection calculation (by either the server or the user) to determine the correct set of documents, or alternatively, the user may store additional information on the server to facilitate such searches. Neither solution is desirable; the former enables the server to learn which documents match each individual keyword of the conjunctive search and the latter results in exponential storage if the user allows for searches on every set of keywords. We define a security model for conjunctive keyword search over encrypted data and present the first schemes for conducting such searches securely. We propose first a scheme for which the communication cost is linear in the number of documents, but that cost can be incurred "offline" before the conjunctive query is asked. The security of this scheme relies on the Decisional Diffie-Hellman (DDH) assumption. We propose a second scheme whose communication cost is on the order of the number of keyword fields and whose security relies on a new hardness assumption.
Article
Cloud computing infrastructure is a promising new technology and greatly accelerates the development of large scale data storage, processing and distribution. However, security and privacy become major concerns when data owners outsource their private data onto public cloud servers that are not within their trusted management domains. To avoid information leakage, sensitive data have to be encrypted before uploading onto the cloud servers, which makes it a big challenge to support efficient keyword-based queries and rank the matching results on the encrypted data. Most current works only consider single keyword queries without appropriate ranking schemes. In the current multi-keyword ranked search approach, the keyword dictionary is static and cannot be extended easily when the number of keywords increases. Furthermore, it does not take the user behavior and keyword access frequency into account. For the query matching result which contains a large number of documents, the out-of-order ranking problem may occur. This makes it hard for the data consumer to find the subset that is most likely satisfying its requirements. In this paper, we propose a flexible multi-keyword query scheme, called MKQE to address the aforementioned drawbacks. MKQE greatly reduces the maintenance overhead during the keyword dictionary expansion. It takes keyword weights and user access history into consideration when generating the query result. Therefore, the documents that have higher access frequencies and that match closer to the users’ access history get higher rankings in the matching result set. Our experiments show that MKQE presents superior performance over the current solutions.
Article
Searchable symmetric encryption (SSE) allows a client to encrypt its data in such a way that this data can still be searched. The most immediate application of SSE is to cloud storage, where it enables a client to securely outsource its data to an untrusted cloud provider without sacrificing the ability to search over it. SSE has been the focus of active research and a multitude of schemes that achieve various levels of security and efficiency have been proposed. Any practical SSE scheme, however, should (at a minimum) satisfy the following properties: sublinear search time, security against adaptive chosen-keyword attacks, compact indexes and the ability to add and delete files efficiently. Unfortunately, none of the previously-known SSE constructions achieve all these properties at the same time. This severely limits the practical value of SSE and decreases its chance of deployment in real-world cloud storage systems. To address this, we propose the first SSE scheme to satisfy all the properties outlined above. Our construction extends the inverted index approach (Curtmola et al., CCS 2006) in several non-trivial ways and introduces new techniques for the design of SSE. In addition, we implement our scheme and conduct a performance evaluation, showing that our approach is highly efficient and ready for deployment.