ArticlePDF Available

An Efficient Privacy-Preserving Ranked Keyword Search Method

January 2015
IEEE Transactions on Parallel and Distributed Systems 27(4):1-1

January 2015
27(4):1-1

DOI:10.1109/TPDS.2015.2425407

Authors:

Chi Chen

Chinese Academy of Sciences

Xiaojie Zhu

University of Oslo

Peisong Shen

Chinese Academy of Sciences

Jiankun Hu

UNSW Sydney

Show all 7 authorsHide

Cloud data owners prefer to outsource documents in an encrypted form for the purpose of privacy preserving. Therefore it is essential to develop efficient and reliable ciphertext search techniques. One challenge is that the relationship between documents will be normally concealed in the process of encryption, which will lead to significant search accuracy performance degradation. Also the volume of data in data centers has experienced a dramatic growth. This will make it even more challenging to design ciphertext search schemes that can provide efficient and reliable online information retrieval on large volume of encrypted data. In this paper, a hierarchical clustering method is proposed to support more search semantics and also to meet the demand for fast ciphertext search within a big data environment. The proposed hierarchical approach clusters the documents based on the minimum relevance threshold, and then partitions the resulting clusters into sub-clusters until the constraint on the maximum size of cluster is reached. In the search phase, this approach can reach a linear computational complexity against an exponential size increase of document collection. In order to verify the authenticity of search results, a structure called minimum hash sub-tree is designed in this paper. Experiments have been conducted using the collection set built from the IEEE Xplore. The results show that with a sharp increase of documents in the dataset the search time of the proposed method increases linearly whereas the search time of the traditional method increases exponentially. Furthermore, the proposed method has an advantage over the traditional method in the rank privacy and relevance of retrieved documents.

Architecture of ciphertext search

…

Algorithm Index

…

Algorithm Quality Hierarchical Clustering (QHC)

…

Retrieval Process

…

Authentication for hierarchical clustering index

…

Figures - uploaded by Jiankun Hu

Content may be subject to copyright.

Content uploaded by Jiankun Hu

Content may be subject to copyright.

requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation

information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems

An Efﬁcient Privacy-Preserving Ranked

Keyword Search Method

Chi Chen, Member, IEEE, Xiaojie Zhu, Student Member, IEEE, Peisong Shen, Student

Member, IEEE, J.Hu, Member, IEEE, S.Guo, Senior Member, IEEE, Z.Tari, Senior Member, IEEE,

and Albert Y. Zomaya, Fellow, IEEE,

Abstract—Cloud data owners prefer to outsource documents in an encrypted form for the purpose of privacy preserving.

Therefore it is essential to develop efﬁcient and reliable ciphertext search techniques. One challenge is that the relationship

between documents will be normally concealed in the process of encryption, which will lead to signiﬁcant search accuracy

performance degradation. Also the volume of data in data centers has experienced a dramatic growth. This will make it even

more challenging to design ciphertext search schemes that can provide efﬁcient and reliable online information retrieval on large

volume of encrypted data. In this paper, a hierarchical clustering method is proposed to support more search semantics and

also to meet the demand for fast ciphertext search within a big data environment. The proposed hierarchical approach clusters

the documents based on the minimum relevance threshold, and then partitions the resulting clusters into sub-clusters until the

constraint on the maximum size of cluster is reached. In the search phase, this approach can reach a linear computational

complexity against an exponential size increase of document collection. In order to verify the authenticity of search results, a

structure called minimum hash sub-tree is designed in this paper. Experiments have been conducted using the collection set

built from the IEEE Xplore. The results show that with a sharp increase of documents in the dataset the search time of the

proposed method increases linearly whereas the search time of the traditional method increases exponentially. Furthermore, the

proposed method has an advantage over the traditional method in the rank privacy and relevance of retrieved documents.

Index Terms—Cloud computing, ciphertext search, ranked search, multi-keyword search, hierarchical clustering, big data,

security

1 INTRODUCTION

AS we step into the big data era, terabyte of

data are produced world-wide per day. Enter-

prises and users who own a large amount of data

usually choose to outsource their precious data to

•An early version of this paper is presented at Workshop BigSecurity

with IEEE INFOCOM 2014 [28]. Extensive enhancements have

been made which includes incorporating a novel veriﬁcation scheme

to help data user verify the authenticity of the search results, and

adding a security analysis as well more details of the proposed

scheme. This work was supported by Strategic Priority Research

Program of Chinese Academy of Sciences (No.XDA06010701) and

National High Technology Research and Development Program of

China(No.2013AA01A24).

•Chi Chen is now with the State Key Laboratory Of Information

Security, Institute of Information Engineering, Chinese Academy of

Sciences, Beijing, China (e-mail: chenchi@iie.ac.cn).

•Xiaojie Zhu is now with the State Key Laboratory Of Information

Security, Institute of Information Engineering, Chinese Academy of

Sciences, Beijing, China (e-mail: zhuxiaojie@iie.ac.cn).

•Peisong Shen is now with the State Key Laboratory Of Information

Security, Institute of Information Engineering, Chinese Academy of

Sciences, Beijing, China (e-mail: shenpeisong@iie.ac.cn).

•J.Hu is now with the Cyber Security Lab, School of Engineering and

IT, University of New South Wales at the Australian Defence Force

Academy, Canberra, ACT 2600, Australia. (e-mail: J.Hu@adfa.edu.au).

•Song Guo is with School of Computer Science and Engineering, The

University of Aizu, Japan. (email: sguo@u-aizu.ac.jp).

•Zahir Tari is with School of Computer Science, RMIT University, Aus-

tralia. (email: zahir.tari@rmit.edu.au).

•Albert Zomaya is with School of Information Technologies, The Uni-

versity of Sydney, Australia. (email: albert.zomaya@sydney.edu.au).

cloud facility in order to reduce data management

cost and storage facility spending. As a result, data

volume in cloud storage facilities is experiencing a

dramatic increase. Although cloud server providers

(CSPs) claim that their cloud service is armed with

strong security measures, security and privacy are

major obstacles preventing the wider acceptance of

cloud computing service[1].

A traditional way to reduce information leakage is

data encryption. However, this will make server-side

data utilization, such as searching on encrypted data,

become a very challenging task. In the recent years,

researchers have proposed many ciphertext search

schemes [35-38][43] by incorporating the cryptogra-

phy techniques. These methods have been proven

with provable security, but their methods need mas-

sive operations and have high time complexity. There-

fore, former methods are not suitable for the big data

scenario where data volume is very big and applica-

tions require online data processing. In addition, the

relationship between documents is concealed in the

above methods. The relationship between documents

represents the properties of the documents and hence

maintaining the relationship is vital to fully express a

document. For example, the relationship can be used

to express its category. If a document is independent

of any other documents except those documents that

are related to sports, then it is easy for us to assert this

document belongs to the category of the sports. Due

requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation

information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems

to the blind encryption, this important property has

been concealed in the traditional methods. Therefore,

proposing a method which can maintain and utilize

this relationship to speed the search phase is desirable.

On the other hand, due to software/hardware fail-

ure, and storage corruption, data search results re-

turning to the users may contain damaged data or

have been distorted by the malicious administrator

or intruder. Thus, a veriﬁable mechanism should be

provided for users to verify the correctness and com-

pleteness of the search results.

In this paper, a vector space model is used and

every document is represented by a vector, which

means every document can be seen as a point in

a high dimensional space. Due to the relationship

between different documents, all the documents can

be divided into several categories. In other words, the

points whose distance are short in the high dimen-

sional space can be classiﬁed into a speciﬁc category.

The search time can be largely reduced by selecting

the desired category and abandoning the irrelevant

categories. Comparing with all the documents in the

dataset, the number of documents which user aims

at is very small. Due to the small number of the

desired documents, a speciﬁc category can be further

divided into several sub-categories. Instead of using

the traditional sequence search method, a backtrack-

ing algorithm is produced to search the target doc-

uments. Cloud server will ﬁrst search the categories

and get the minimum desired sub-category. Then the

cloud server will select the desired k documents from

the minimum desired sub-category. The value of k is

previously decided by the user and sent to the cloud

server. If current sub-category can not satisfy the k

documents, cloud server will trace back to its parent

and select the desired documents from its brother

categories. This process will be executed recursively

until the desired k documents are satisﬁed or the

root is reached. To verify the integrity of the search

result, a veriﬁable structure based on hash function is

constructed. Every document will be hashed and the

hash result will be used to represent the document.

The hashed results of documents will be hashed again

with the category information that these documents

belong to and the result will be used to represent

the current category. Similarly, every category will

be represented by the hash result of the combination

of current category information and sub-categories

information. A virtual root is constructed to represent

all the data and categories. The virtual root is denoted

by the hash result of the concatenation of all the

categories located in the ﬁrst level. The virtual root

will be signed so that it is veriﬁable. To verify the

search result, user only needs to verify the virtual root,

instead of verifying every document.

2 EXISTING SOLUTIONS

In recent years, searchable encryption which provides

text search function based on encrypted data has

been widely studied, especially in security deﬁnition,

formalizations and efﬁciency improvement, e.g. [2-7].

As shown in Fig.1, the proposed method is compared

with existing solutions and has the advantage in

maintaining the relationship between documents.

2.1 Single Keyword Searchable Encryption

Song et al [8] ﬁrst introduced the notion of search-

able encryption. They propose to encrypt each word

in the document independently. This method has a

high searching cost due to the scanning of the whole

data collection word by word. Goh et al [9] formally

deﬁned a secure index structure and formulate a

security model for index known as semantic security

against adaptive chosen keyword attack (ind-cka).

They also developed an efﬁcient ind-cka secure index

construction called z-idx by using pseudo-random

functions and bloom ﬁlters. Cash et al [42] recently

design and implement an efﬁcient data structure.

Due to the lack of rank mechanism, users have to

take a long time to select what they want when

massive documents contain the query keyword. Thus,

the order-preserving techniques are utilized to realize

the rank mechanism, e.g. [10-12]. Wang et al [13]

use encrypted invert index to achieve secure ranked

keyword search over the encrypted documents. In the

search phase, the cloud server computes the relevance

score between documents and the query. In this way,

relevant documents are ranked according to their

relevance score and users can get the top-k results.

In the public key setting, Boneh et al [3] designed

the ﬁrst searchable encryption construction, where

anyone can use public key to write to the data stored

on server but only authorized users owning private

key can search. However, all the above mentioned

techniques only support single keyword search.

2.2 Multiple Keyword Searchable Encryption

To enrich search predicates, a variety of conjunctive

keyword search methods (e.g. [7, 14-17]) have been

proposed. These methods show large overhead, such

as communication cost by sharing secret, e.g. [15], or

computational cost by bilinear map, e.g.[7]. Pang et al

[18] propose a secure search scheme based on vector

space model. Due to the lack of the security analysis

for frequency information and practical search per-

formance, it is unclear whether their scheme is secure

and efﬁcient or not. Cao et al [19] present a novel

architecture to solve the problem of multi-keyword

ranked search over encrypted cloud data. But the

search time of this method grows exponentially ac-

companying with the exponentially increasing size of

the document collections. Sun et al [20] give a new

requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation

information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems

architecture which achieves better search efﬁciency.

However, at the stage of index building process, the

relevance between documents is ignored. As a result,

the relevance of plaintexts is concealed by the encryp-

tion, users expectation cannot be fulﬁlled well. For

example: given a query containing Mobile and Phone,

only the documents containing both of the keywords

will be retrieved by traditional methods. But if tak-

ing the semantic relationship between the documents

into consideration, the documents containing Cell and

Phone should also be retrieved. Obviously, the second

result is better at meeting the users expectation.

2.3 Veriﬁable Search Based on Authenticated In-

dex

The idea of data veriﬁcation has been well studied in

the area of databases. In a plaintext database scenario,

a variety of methods have been produced, e.g. [21-

23]. Most of these works are based on the original

work by Merkle [24, 25] and reﬁnements by Naor and

Nissm [26] for certiﬁcate revocation. Merkle hash tree

and cryptographic signature techniques are used to

construct authenticated tree structure upon which end

users can verify the correctness and completeness of

the query results.

Pang et al [27] apply the Merkle hash tree based on

authenticated structure to text search engines. How-

ever, they only focus on the veriﬁcation-speciﬁc issues

ignoring the search privacy preserving capabilities

that will be addressed in this paper.

The hash chain is used to construct a single key-

word search result veriﬁcation scheme by Wang et al

[10]. Sun et al [20] use Merkle hash tree and cryp-

tographic signature to create a veriﬁable MDB-tree.

However, their work cannot be directly used in our

architecture which is oriented for privacy-preserving

multiple keyword search. Thus, a proper mechanism

that can be used to verify the search results within

big data scenario is essential to both the CSPs and

end users.

3 OU R CONTRIBUTION

In this paper, we propose a multi-keyword ranked

search over encrypted data based on hierarchical

clustering index (MRSE-HCI) to maintain the close

relationship between different plain documents over

the encrypted domain in order to enhance the search

efﬁciency. In the proposed architecture, the search

time has a linear growth accompanying with an ex-

ponential growing size of data collection. We derive

this idea from the observation that users retrieval

needs usually concentrate on a speciﬁc ﬁeld. So we

can speed up the searching process by computing

relevance score between the query and documents

which belong to the same speciﬁc ﬁeld with the query.

As a result, only documents which are classiﬁed to

the ﬁeld speciﬁed by users query will be evaluated to

get their relevance score. Due to the irrelevant ﬁelds

ignored, the search speed is enhanced.

We investigate the problem of maintaining the

close relationship between different plain documents

over an encrypted domain and propose a cluster-

ing method to solve this problem. According to the

proposed clustering method, every document will be

dynamically classiﬁed into a speciﬁc cluster which

has a constraint on the minimum relevance score

between different documents in the dataset. The rele-

vance score is a metric used to evaluate the relation-

ship between different documents. Due to the new

documents added to a cluster, the constraint on the

cluster may be broken. If one of the new documents

breaks the constraint, a new cluster center will be

added and the current document will be chosen as

a temporal cluster center. Then all the documents

will be reassigned and all the cluster centers will be

reelected. Therefore, the number of clusters depends

on the number of documents in the dataset and the

close relationship between different plain documents.

In other words, the cluster centers are created dynam-

ically and the number of clusters is decided by the

property of the dataset.

We propose a hierarchical method in order to get

a better clustering result within a large amount of

data collection. The size of each cluster is controlled

as a trade-off between clustering accuracy and query

efﬁciency. According to the proposed method, the

number of clusters and the minimum relevance score

increase with the increase of the levels whereas the

maximum size of a cluster reduces. Depending on the

needs of the grain level, the maximum size of a cluster

is set at each level. Every cluster needs to satisfy the

constraints. If there is a cluster whose size exceeds

the limitation, this cluster will be divided into several

sub-clusters.

We design a search strategy to improve the rank

privacy. In the search phase, the cloud server will

ﬁrst compute the relevance score between query and

cluster centers of the ﬁrst level and then chooses the

nearest cluster. This process will be iterated to get

the nearest child cluster until the smallest cluster has

been found. The cloud server computes the relevance

score between query and documents included in the

smallest cluster. If the smallest cluster can not satisfy

the number of desired documents which is previously

decided by user, cloud server will trace back to the

parent cluster of the smallest cluster and the brother

clusters of the smallest cluster will be searched. This

process will be iterated until the number of desired

documents is satisﬁed or the root is reached. Due to

the special search procedures, the rankings of docu-

ments among their search results are different with

the rankings derived from traditional sequence search.

Therefore, the rank privacy is enhanced.

Some part of the above work has been presented

in [28]. For further improvement, we also construct

requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation

information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems

a veriﬁable tree structure upon the hierarchical clus-

tering method to verify the integrity of the search

result in this paper. This authenticated tree structure

mainly takes the advantage of the Merkle hash tree

and cryptographic signature. Every document will be

hashed and the hash result will be used as the repre-

sentative of the document. The smallest cluster will be

represented by the hash result of the combination of

the concatenation of the documents included in the

smallest cluster and own category information. The

parent cluster is represented by the hash result of the

combination of the concatenation of its children and

own category information. A virtual root is added and

represented by the hash result of the concatenation of

the categories located in the ﬁrst level. In addition,

the virtual root will be signed so that user can achieve

the goal of verifying the search result by verifying the

virtual root.

In short, our contributions can be summarized as

follows:

1) We investigate the problem of maintaining the

close relationship between different plain docu-

ments over an encrypted domain and propose a

clustering method to solve this problem.

2) We proposed the MRSE-HCI architecture to

speed up server-side searching phase. Accompa-

nying with the exponential growth of document

collection, the search time is reduced to a linear

time instead of exponential time.

3) We design a search strategy to improve the rank

privacy. This search strategy adopts the back-

tracking algorithm upon the above clustering

method. With the growing of the data volume,

the advantage of the proposed method in rank

privacy tends to be more apparent.

4) By applying the Merkle hash tree and crypto-

graphic signature to authenticated tree structure,

we provide a veriﬁcation mechanism to assure

the correctness and completeness of search re-

sults.

The organization of the following parts of the paper

is as follows: Section IV describes the system model,

threat model, design goals and notations. The archi-

tecture and detailed algorithm are displayed in section

V. We discuss the efﬁciency and security of MRSE-

HCI scheme in section VI. An evaluation method is

provided in Section VII. Section VIII demonstrates the

result of our experiments. Section IX concludes the

paper.

4 DEFINITION AND BACKGROUND

4.1 System Model

The system model contains three entities, as illus-

trated in Fig. 1, the data owner, the data user, and

the cloud server.The box with dashed lines in the

ﬁgure indicates the added component to the existing

architecture.

Fig. 1 Architecture of ciphertext search

The data owner is responsible for collecting doc-

uments, building document index and outsourcing

them in an encrypted format to the cloud server.

Apart from that, the data user needs to get the autho-

rization from the data owner before accessing to the

data. The cloud server provides a huge storage space,

and the computation resources needed by ciphertext

search. Upon receiving a legal request from the data

user, the cloud server searches the encrypted index,

and sends back top-k documents that are most likely

to match users query [12]. The number k is properly

chosen by the data user. Our system aims at protecting

data from leaking information to the cloud server

while improving the efﬁciency of ciphertext search.

In this model, both the data owner and the data user

are trusted, while the cloud server is semi-trusted,

which is consistent with the architecture in [10, 19, 29].

In other words, the cloud server will strictly follow

the predicated order and try to get more information

about the data and the index.

4.2 Threat Model

The adversarys ability can be concluded in two threat

models.

Known Ciphertext Model

In this model, Cloud server can get encrypted docu-

ment collection, encrypted data index, and encrypted

query keywords.

Known Background Model

In this model, cloud server knows more informa-

tion than that in known ciphertext model. Statistical

background information of dataset, such as the docu-

ment frequency and term frequency information of a

speciﬁc keyword, can be used by the cloud server to

launch a statistical attack to infer or identify speciﬁc

keyword in the query [10, 11], which further reveals

the plaintext content of documents. The adversarys

ability can be represented in the above two threat

models.

4.3 Design Goals

•Search efﬁciency. The time complexity of search

time of the MRSE-HCI scheme needs to be loga-

rithmic against the size of data collection in order

to deal with the explosive growth of document

size in big data scenario.

requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation

information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems

•Retrieval accuracy. Retrieval precision is related

to two factors: the relevance between the query

and the documents in result set, and the relevance

of documents in the result set.

•Integrity of the search result. The integrity of the

search results includes three aspects:

1) Correctness. All the documents returned

from servers are originally uploaded by the

data owner and remain unmodiﬁed.

2) Completeness. No qualiﬁed documents are

omitted from the search results.

3) Freshness. The returned documents are the

latest version of documents in the dataset.

•Privacy requirements. We set a series of privacy

requirements which current researchers mostly

focus on.

1) Data privacy. Data privacy presents the con-

ﬁdentiality and privacy of documents. The

adversary cannot get the plaintext of doc-

uments stored on the cloud server if data

privacy is guaranteed. Symmetric cryptog-

raphy is a conventional way to achieve data

privacy.

2) Index privacy. Index privacy means the abil-

ity to frustrate the adversary attempt to

steal the information stored in the index.

Such information includes keywords and the

TF (Term Frequency) of keywords in docu-

ments, the topic of documents, and so on.

3) Keyword privacy. It is important to protect

users query keywords. Secure query gen-

eration algorithm should output trapdoors

which leak no information about the query

keywords.

4) Trapdoor unlinkability. Trapdoor unlinkabil-

ity means that each trapdoor generated from

the query is different, even for the same

query. It can be realized by integrating a

random function in the trapdoor generation

process. If the adversary can deduce the cer-

tain set of trapdoors which all corresponds

to the same keyword, he can calculate the

frequency of this keyword in search request

in a certain period. Combined with the docu-

ment frequency of keyword in known back-

ground model, he/she can use statistical

attack to identify the plain keyword behind

these trapdoors.

5) Rank privacy. Rank order of search results

should be well protected. If the rank or-

der remains unchanged, the adversary can

compare the rank order of different search

results, further identify the search keyword.

4.4 Notations

In this paper, notations presented in table 1 are used.

TABLE 1

Notation

diThe ith document vector, denoted as di=

{di,1,...,di,n }, where di,j represents whether the

jth keyword in the dictionary appears in document

di.

mThe number of documents in the data collection.

nThe size of dictionary DW.

CC V The collection of cluster centers vectors, denoted as

CC V ={c1,,cn}, where ciis the average vector of

all document vectors in the cluster.

CC ViThe collection of the ith level cluster center vectors,

denoted as CC Vi={vi,1,...,vi,n }where Vi,jrep-

resents the jth vector in the ith level.

DC The information of documents classiﬁcation such as

document id list of a certain cluster.

DVThe collection of document vectors, denoted as DV=

d1, d2,,dm.

DWThe dictionary, denoted as Dw={w1, w2,...,wn}.

FwThe ranked id list of all documents according to their

relevance to keyword w.

IcThe clustering index which contains the encrypted

vectors of cluster centers.

IdThe traditional index which contains encrypted doc-

ument vectors.

LiThe minimum relevance score between different doc-

uments in the ith level of a cluster.

QV The query vector.

T H A ﬁxed maximum number of documents in a cluster.

TwThe encrypted query vector for users query.

5 ARCHITECTURE AND ALGORITHM

5.1 System Model

In this section, we will introduce the MRSE-HCI

scheme. The vector space model adopted by the

MRSE-HCI scheme is same as the MRSE [19], while

the process of building index is totally different. The

hierarchical index structure is introduced into the

MRSE-HCI instead of sequence index. In MRSE-HCI,

every document is indexed by a vector. Every dimen-

sion of the vector stands for a keyword and the value

represents whether the keyword appears or not in the

document. Similarly, the query is also represented by

a vector. In the search phase, cloud server calculates

the relevance score between the query and documents

by computing the inner product of the query vector

and document vectors and return the target docu-

ments to user according to the top krelevance score.

Due to the fact that all the documents outsourced

to the cloud server is encrypted, the semantic rela-

tionship between plain documents over the encrypted

documents is lost. In order to maintain the semantic

relationship between plain documents over the en-

crypted documents, a clustering method is used to

cluster the documents by clustering their related index

vectors. Every document vector is viewed as a point

in the n-dimensional space. With the length of vectors

being normalized, we know that the distance of points

in the n-dimensional space reﬂect the relevance of

corresponding documents. In other word, points of

high relevant documents are very close to each other

in the n-dimensional space. As a result, we can cluster

requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation

information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems

Fig. 2 MRSE-HCI architecture

the documents based on the distance measure.

With the volume of data in the data center has ex-

perienced a dramatic growth, conventional sequence

search approach will be very inefﬁcient. To promote

the search efﬁciency, a hierarchical clustering method

is proposed. The proposed hierarchical approach clus-

ters the documents based on the minimum relevance

threshold at different levels, and then partitions the

resulting clusters into sub-clusters until the constraint

on the maximum size of cluster is reached. Upon

receiving a legal request, cloud server will search the

related indexes layer by layer instead of scanning all

indexes.

5.2 MRSE-HCI Architecture

MRSE-HCI architecture is depicted by Fig. 2, where

the data owner builds the encrypted index depending

on the dictionary, random numbers and secret key,

the data user submits a query to the cloud server

for getting desired documents, and the cloud server

returns the target documents to the data user. This

architecture mainly consists of following algorithms.

•Keygen(1l(n))→(sk, k):It is used to generate the

secret key to encrypt index and documents.

•Index(D, sk)→I:Encrypted index is generated

in this phase by using the above mentioned secret

key. At the same time, clustering process is also

included current phase.

•Enc(D, k)→E:The document collection is en-

crypted by a symmetric encryption algorithm

which achieves semantic security.

•T rapdoor(w, sk)→Tw: It generates encrypted

query vector Twwith users input keywords and

secret key.

•Search(Tw, I, ktop )→(Iw, Ew): In this phase,

cloud server compares trapdoor with index to get

the top-kretrieval results.

•Dec(Ew, k)→Fw:The returned encrypted docu-

ments are decrypted by the key generated in the

ﬁrst step.

The concrete functions of different components is

described as below.

1) Keygen(1l(n): The data owner randomly gen-

erates a (n+u+ 1) bit vector Swhere every

element is a integer 1 or 0 and two invertible

(n+u+ 1) ×(n+u+ 1)matrices whose elements

are random integers as secret key sk.The secret

key k is generated by the data owner choosing

an n-bit pseudo sequence.

2) Index(D, sk):As show in the Fig.3, the data

owner uses tokenizer and parser to analyze

every document and gets all keywords. Then

data owner uses the dictionary Dwto trans-

form documents to a collection of document

vectors DV . Then the data owner calculates the

DC and CCV by using a quality hierarchical

clustering (QHC) method which will be illus-

trated in section C. After that, the data owner

applies the dimension-expanding and vector-

splitting procedure to every document vector. It

is worth noting that CC V is treated equally as

DV . For dimension-expanding, every vector in

DV is extended to (n+u+ 1) bit-long, where

the value in n+j(0 ≤j≤u)dimension is

an integer number generated randomly and the

last dimension is set to 1. For vector-splitting,

every extended document vector is split into

two (n+u+ 1) bit-long vectors, V0and V00

with the help of the (n+u+ 1)bit vector S

as a splitting indicator. If the ith element of S

(Si) is 0, then we set V00

i=V0

i=Vi; If ith

element of S (Si) is 1, then V00

iis set to a random

number and V0

i=Vi−V00

i. Finally, the traditional

index Idis encrypted as Id={MT

1V0, M T

2V00}by

using matrix multiplication with the sk, and Ic

is generated in a similar way. After this, Id,Ic,

and DC are outsourced to the cloud server.

3) Enc(D, k)The data owner adopts a secure sym-

metric encryption algorithm (e.g. AES) to en-

crypt the plain document set D and outsources

it to the cloud server.

4) T rapdoor(w, sk):The data user sends the query

to the data owner who will later analyze the

query and builds the query vector QV by an-

alyzing the keywords of query with the help

of dictionary DW,QV then is extended to a

(n+u+ 1) bit query vector. Subsequently,v

random positions chosen from a range (n, n +u]

in QV are set to 1, others are set to 0.The value

at last dimension of QV is set to a random

number t[0,1].Then the ﬁrst (n+u)dimensions

of QW, denoted as qw, is scaled by a random

number r(r6= 0) ,Qw= (r·qw, t). After that, Qw

is split into two random vectors as {Q0

W, Q00

with vector-splitting procedure which is similar

to that in the Index(D, sk)phase. The difference

is that if the ith bit of Sis 1, then we have

i=q00

i=qi; If the ith bit of Sis 0,q0

iis set

as a random number and q00

i=qi−q0

i. Finally,

the encrypted query vector Twis generated as

Tw={M−1

1Q0

w, M −1

2Q00

w}and sent back to the

data user.

5) Search(Tw, I, ktop ):Upon receiving the Twfrom

data user, the cloud server computes the rele-

requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation

information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems

Fig. 3 Algorithm Index

Fig. 4 Algorithm Dynamic k-means

vance score between Twand index Icand then

chooses the matched cluster which has the high-

est relevance score. For every document con-

tained in the matched cluster, the cloud server

extract its corresponding encrypted document

vector in Id, and calculates its relevance score S

with Tw, as described in the Equation 1. Finally,

these scores of documents in the matched clus-

ter are sorted and the top ktop documents are

returned by the cloud server. The detail will be

discussed in the section D.

S=Tw·Ic

={M−1

1Q0

w, M −1

2Q00

w}·{MT

1V0, M T

2V00}

=Q0

w·V0+Q0

wV00

=Qw·V

(1)

6) Dec(Ew, k):The data user utilizes the secret key

k to decrypt the returned ciphertext Ew.

5.3 Relevance Measure

In this paper, the concept of coordinate matching

[30]is adopted as a relevance measure. It is used

to quantify the relevance of document-query and

document-document. It is also used to quantify the

relevance of the query and cluster centers. Equation

2 deﬁnes the relevance score between document di

and query qw. Equation 3 deﬁnes the relevance score

between query qwand cluster center lci,j . Equation 4

deﬁnes the relevance score between document diand

dj.

Sqdi=

n+u+1

t=1

(qw,t ×di,t)(2)

Sqci=

n+u+1

t=1

(qw,t ×lci,j,t )(3)

Fig. 5 Algorithm Quality Hierarchical Clustering (QHC)

Fig. 6 Clustering Process

Sddi=

n+u+1

t=1

(di,t ×dj,t)(4)

5.4 Quality Hierarchical Clustering Algorithm

So far, a lot of hierarchical clustering methods has

been proposed. However all of these methods are

not comparable to the partition clustering method in

terms of time complexity performance. K-means[31]

and K-medois[32] are popular partition clustering al-

gorithms. But the kis ﬁxed in the above two methods,

which can not be applied to the situation of dynamic

number of cluster centers. We propose a quality hi-

erarchical clustering (QHC) algorithm based on the

novel dynamic K-means.

As the proposed dynamic K-means algorithm shown

in the Fig.4, the minimum relevance threshold of the

clusters is deﬁned to keep the cluster compact and

dense. If the relevance score between a document and

its center is smaller than the threshold, a new cluster

center is added and all the documents are reassigned.

The above procedure will be iterated until kis stable.

Comparing with the traditional clustering method, k

is dynamically changed during the clustering process.

This is why it is called dynamic K-means algorithm.

The QHC algorithm is illustrated in the Fig.5. It goes

like that. Every cluster will be checked on whether

its size exceeds the maximum number TH or not. If

the answer is ”yes”, this ”big” cluster will be split

into child clusters which are formed by using the

dynamic K-means on the documents of this cluster.

This procedure will be iterated until all clusters meet

the requirement of maximum cluster size. Clustering

procedure is illustrated in Fig.6. All the documents

are denoted as points in a coordinate system. These

points are initially partitioned into two clusters by

using dynamic K-means algorithm when the k= 2.

These two bigger clusters are depicted by the elliptical

shape. Then these two clusters are checked to see

whether their points satisfy the distance constraint.

The second cluster does not meet this requirement,

thus a new cluster center is added with k= 3 and

the dynamic K-means algorithm runs again to partition

requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation

information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems

Fig. 7 Retrieval Process

Fig. 8 Algorithm Building-minimum hash sub-tree

the second cluster into two parts. Then the data

owner checks whether these clusters size exceed the

maximum number TH . Cluster 1 is split into two sub-

clusters again due to its big size. Finally all points are

clustered into 4 clusters as depicted by the rectangle.

5.5 Search Algorithm

The cloud server needs to ﬁnd the cluster that most

matches the query. With the help of cluster index Ic

and document classiﬁcation DC , the cloud server

uses an iterative procedure to ﬁnd the best matched

cluster. Following instance demonstrates how to get

matched one:

1) The cloud server computes the relevance score

between Query Twand encrypted vectors of the

ﬁrst level cluster centers in cluster index Ic, then

chooses the ith cluster center Ic,1,i which has the

highest score.

2) The cloud server gets the child cluster centers of

the cluster center, then computes the relevance

score between Twand every encrypted vectors

of child cluster centers, and ﬁnally gets the

cluster center Ic,2,i with the highest score. This

procedure will be iterated until that the ultimate

cluster center Ic,l,j in last level lis achieved.

In the situation depicted by Fig.7, there are 9 docu-

ments which are grouped into 3 clusters. After calcu-

lating the relevance score with trapdoor Tw, cluster

1, which is shown within the box of dummy line in

Fig.7, is found to be the best match. Documents d1,d3

,d9belong to cluster 1, then their encrypted document

vectors in the Idare extracted out to compute the

relevance score with Tw.

5.6 Search Result Veriﬁcation

The retrieved data have high possibility to be wrong

since the network is unstable and the data may be

damaged due to the hardware/software failure or

malicious administrator or intruder. Verifying the au-

thenticity of search results is emerging as a critical

Fig. 9 Algorithm Processing-minimum hash sub-tree

issue in the cloud environment. We, therefore, de-

signed a signed hash tree to verify the correctness and

freshness of the search results.

•Building.The data owner builds the hash tree

based on the hierarchical index structure. The

algorithm shown in the Fig.8 is described as fol-

lows. The hash value of the leaf node of the tree is

h(id kversion kΦ(id)) where id means document

id, version means document version and Φ(id)

means the document contents. The value of non-

leaf node is a pair of values (id, h(id khchild)

where id denotes the value of the cluster center

or document vector in the encrypted index, and

hchild is the hash value of its child node. The

hash value of tree root node is based on the

hash values of all clusters in the ﬁrst level. It

is worth noting that the root node denotes the

data set which contains all clusters. Then the data

owner generates the signature of the hash values

of the root node and outsources the hash tree

including the root signature to the cloud server.

Cryptographic signature σ(e.g., RSA signature,

DSA signature) can be used here to authenticate

the hash value of root node.

•Processing.By the algorithm shown in the Fig.9,

the cloud server returns the root signature and

the minimum hash sub-tree (MHST) to client.

The minimum hash sub-tree includes the hash

values of leaf nodes in the matched cluster and

non-leaf node corresponding to all cluster centers

used to ﬁnd the matched cluster in the searching

phase.For example, in the Fig.10, the search result

is document D,Eand F. Then the leaf nodes are

D,E,Fand G, and non-leaf nodes includesC1,

C2,C3,C4,dD,dE,dF, and dG. In addition, the

root is included in the non-leaf node.

•Verifying.The data owner uses the minimum hash

sub-tree to re-compute the hash values of nodes,

in particular the root node which can be further

veriﬁed by the root signature. If all nodes are

matched, then the correctness and freshness is

guaranteed. Then the data owner re-searches the

index constructed by retrieved values in MHST.

If the search result is same as the retrieved result,

the completeness, correctness and freshness all

are guaranteed.

As shown in the Fig.10, in the building phase, all

documents are clustered into 2big clusters and 4

small clusters, and each big cluster contains 2small

requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation

information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems

Fig. 10 Authentication for hierarchical clustering index

clusters. The hash value of leaf node Aish(idAk

version kΦ(idA)) , the value of the non-leaf node

C3is (idC3, h(idC3khAkhBkhC), and the value of

non-leaf node C1is (idC1, h(idC1khC3khC4)) . The

other values of leaf nodes and non-leaf nodes are

generated similarly. In order to combine all ﬁrst-level

clusters into a tree, a virtual root node is created by

the data owner with a hash value h(hC1,2khC2,2)

where C1,2and C2,2denotes the second part of cluster

center 1and 2respectively. Then the data owner signs

the root node, e.g., σ(h(hC1,2khC2,2)) = (hC1,2k

hC2,2, e(h(hC1,2khC2,2))k, g), and outsources it to the

cloud server.

In the processing phase, suppose that the cluster

C4is the matched cluster and the returned top-3

documents are D,E, and F. Then the minimum hash

sub-tree includes the hash values of node D,E,F,

dD,dE,dF,dG,C3,C2,C1,C4and the signed root

σ(h(hC1,2khC2,2)).

In the verifying phase, upon receiving the signed

root, the data user ﬁrst check e(h(hC1,2khC2,2), g)k?

e(sigkh(h(hC1,2khC2,2)), g). If it is not true, the

retrieved hash tree is not authentic, otherwise the re-

turned nodes, D,E,F,dD,dE,dF,dG,C3,C2,C1,C4,

works together to verify each other and reconstruct

the hash tree. If all the nodes are authenticate, the

returned hash tree are authenticate. Then the data user

re-computes the hash value of the leaf nodes D,Eand

Fby using returned documents. These new generated

hash values are compared with the corresponding

returned hash values. If there is no difference, the

retrieved documents is correct. Finally, the data user

uses the trapdoor to re-search the index constructed

by the ﬁrst part of retrieved nodes. If the search result

is same as the retrieved result, the search result is

complete.

5.7 Dynamic Data Collection

As the documents stored at server may be deleted or

modiﬁed and new documents may be added to the

original data collection, a mechanism which supports

dynamic data collection is necessary. A naive way to

address these problems is downloading all documents

and index locally and updating the data collection

and index. However, this method needs huge cost in

bandwidth and local storage space.

To avoid updating index frequently, we provide a

practical strategy to deal with insertion, deletion and

modiﬁcation operations. Without loss of generality,

we use following examples to illustrate the workings

of the strategy. The data owner preserves many empty

entries in the dictionary for new documents. If a new

document contains new keywords, the data owner

ﬁrst adds these new keywords to the dictionary and

then constructs a document vector based on the new

dictionary. The data owner sends the trapdoor gen-

erated by the document vector, encrypted document

and encrypted document vector to the cloud sever.

The cloud sever ﬁnds the closest cluster, and puts the

encrypted document and encrypted document vector

into it.

As every cluster has a constraint on the maximum

size, it is possible that the number of documents in

a cluster exceeds the limitation due to the insertion

operation. In this case, all the encrypted document

vectors belonging to the broken cluster are returned

to the data owner. After decryption of the retrieved

document vectors, the data owner re-builds the sub-

index based on the deciphered document vectors. The

sub-index is re-encrypted and re-outsourced to the

cloud server.

Upon receiving a deletion order, the cloud server

searches the target document. Then the cloud server

deletes the document and the corresponding docu-

ment vector.

Modifying a document can be described as deleting

the old version of the document and inserting the

new version. The operation of modifying documents,

therefore, can be realized by combining insertion op-

eration and deletion operation.

To deal with this impact on the hash tree, a lazy

update strategy is designed. For the insertion opera-

tion, the corresponding hash value will be calculated

and marked as a raw node, while the original nodes

in the hash tree will be kept unchanged because the

original hash tree still supports document veriﬁcation

except the new document. Only when the new added

document is accessed, the hash tree will be updated.

Similar concept is used in the deletion operation. The

only difference is that the deletion operation will not

bring the hash tree update.

6 EFFICIENCY AND SECURITY

6.1 Search Efﬁciency

The search process can be divided into

T rapdoor(w, sk)phase and Search(Tw, I, ktop )phase.

The number of operation needed in T rapdoor(w, sk)

phase is illustrated as in Equation 5, where, nis the

number of keywords in the dictionary, and wis the

number of query keywords.

O(MRSE −H CI)=5n+u−v−w+ 5 (5)

Due to the time complexity of T rapdoor(w, sk)

phase independent to DC, when DC increases expo-

nentially,it can be described as O(1).

requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation

information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems

The difference of the search process between the

MRSE-HCI and the MRSE is the retrieval algorithm

used in this phase. In the Search(Tw, I, ktop )phase

of the MRSE, the cloud server needs to compute the

relevance score between the encrypted query vector

Twand all encrypted document vectors in Id, and

get the top-k ranked document list Fw. The number

of operations need in Search(TW, I, ktop )phase is

illustrated as in Equation 6, where mrepresents the

number of documents in DC ,and nrepresents the

number of keywords in the dictionary.

O(MRSE)=2m∗(2n+ 2u+ 1) + m−1(6)

However, in the Search(TW, I, ktop)phase of MRSE-

HCI, the cloud server uses the information DC to

quickly locate the matched cluster and only com-

pares Twto a limited number of encrypted document

vectors inId.The number of operations needed in

Search(TW, I, ktop )phase is illustrated in equation 7,

where kirepresents the number of cluster centers

needed to be compared with in the ith level, and c

represents the number of document vectors in the

matched cluster.

O(MRSE −H CI)=(

i=1

ki)∗2∗(2n+ 2u+ 1)

+c(2 ∗(2n+ 2u+ 1)) + c−1

(7)

When DC increases exponentially, mcan be set to 2l.

The time complexity of the traditional MRSE is O(2l),

while the time complexity of the proposed MRSE-HCI

is only O(l).

The total search time can be calculated as given

in Equation 8 below, where O(trapdoor)is O(1) ,and

O(query)relies on the DC.

O(searchT ime) = O(trapdoor) + O(query)(8)

In short, when the number of documents in DC has

an exponential growth, the search time of MRSE-

HCI increases linearly while the traditional methods

increase exponentially.

6.2 Security Analysis

To express the security analysis brieﬂy, we adopt

some concepts from [38-40] and deﬁne what kinds of

information will be leaked to the curious-but-honest

server.

The basic information of documents and queries

are inevitably leaked to the honest-but-curious server

since all the data are stored at the server and the

queries submitted to the server. Moreover, the access

pattern and search pattern cannot be preserved in

MRSE-HCI as well as previous searchable encryption

[19] [39-41].

Deﬁnition 1 (Size Pattern) Let Dbe a document

collection. The size pattern induced by a q-query is

a tuple a(D, Q)=(m, |Q1|,· · · ,|Qq|)where mis the

number of documents and |Qi|is the size of query Qi.

Deﬁnition 2 (Access Pattern) Let Dbe a document

collection and Ibe an index over D. The access

pattern induced by a q-query is a tuple b(D, Q) =

(I(Q1), , I(Qq)), where I(Qi)is a set of identiﬁers

returned by query Qi, for 1≤i≤q.

Deﬁnition 3 (Search Pattern) Let Dbe a document

collection. The search pattern induced by a q-query is

am×qbinary matrix c(D, Q)such that for 1≤i≤m

and 1≤j≤qthe element in the ithrow and jth

column is 1, if an document identiﬁer idiis returned

by a query Qj.

Deﬁnition 4 (known ciphertext model secure) Let

Π = (Key gen, Index, E nc, Trapdoor, Search, Dec)be

an index-based MRSE-HCI scheme over dictionary

Dw,n∈N, be the security parameter, the known

ciphertext model secure experiment P rivK kcm

A,Π(n)is

described as follows.

1) The adversary submits two document collec-

tions D0and D1with the same length to a

challenger.

2) The challenger generates a secret key {sk, k}by

running Keygen(1l(n)).

3) The challenger randomly choose a bit b∈

{0,1}, and returns Index(Db, skb)→Iband

Enc(Db, kb)→Ebto the adversary.

4) The adversary outputs a bit b0

5) The output of the experiment is deﬁned to be 1

if b0=b, and 0otherwise.

We say MRSE-HCI scheme is secure under known

ciphertext model if for all probabilistic polynomial-

time adversaries Athere exists a negligible function

negl(n)such that

P r(P riv kkcm

A,Π= 1) ≤1/2 + negl(n)(9)

Proof The adversary A distinguishes the document

collections depending on analyzing the secret key,

index and encrypted document collection. Then we

have equation 10, where Adv(AD({sk, k})) is the ad-

vantage for adversary A to distinguish the secret

key from two random matrixes and two random

strings, Adv(AD(I)) is the advantage to distinguish

the index from a random string and Adv(AD(E)) is

the advantage to distinguish the encrypted documents

from random strings.

P r(P riv Kkcm

A,Π(n) = 1) = 1/2+

Adv(AD(sk, k )) + Adv(AD(I)) + Adv(AD(E)) (10)

The elements of two matrixes in the secret key are

randomly chosen from {0,1}l(n), and the split indica-

tor Sand key kare also chosen uniformly at random

from {0,1}l(n). Given {0,1}l(n),Adistinguishes the se-

cret key from two random matrixes and two random

strings with a negligible probability. Then there exits

a negligible function negl1(n)such that

requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation

information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems

Adv(AD(sk, k )) = |Pr (Keygen(1l(n))→(sk, k))−

P r(Random →(skr, kr))| ≤ negl1(n)

(11)

where skrdenotes two random matrixes and a ran-

dom string, and kris a random string. In our scheme,

the encryption of hierarchical index is essential to

encrypt all the document vectors and cluster center

vectors. All the cluster center vectors are treated as

document vectors in the encryption phase. Eventually,

all the document vectors and cluster center vectors are

encrypted by the secure KNN. As the secure KNN is

known plaintext attack (KPA) secure [33], the hier-

archical index is secure under the known ciphertext

model. Then there exists a negligible function negl2(n)

satisfying that

Adv(AD(I)) = |P r(I ndex(D, sk)→(I))

−P r(Random →(Ir))| ≤ negl2(n)(12)

where Iris a random string.

Since the encryption algorithm used to encrypt

Dbis semantic secure, the encrypted documents are

secure under known ciphertext model. Then there

exists a negligible function negl3(n)such that

Adv(AD(E)) = |P r(Enc(D, k)→(E))−

P r(Random →(Er))| ≤ negl3(n)(13)

Where Eris a random string set.

According equation 10, 11, 12 and 13, we can get

equation 14.

P r(P riv kkcm

A,Π= 1) ≤1/2+

negl1(n) + negl2(n) + negl3(n)(14)

negl(n) = negl1(n) + negl2(n) + negl3(n)(15)

P r(P riv kkcm

A,Π)≤1/2 + negl(n)(16)

By combining equation 14 and 15, we can conclude

equation 16. Then, we say MRSE-HCI is secure under

know ciphertext model.

7 EVALUATION METHOD

7.1 Search precision

The search precision can quantify the users satisfac-

tion. The Retrieval precision is related to two fac-

tors: the relevance between documents and the query,

and the relevance of documents between each other.

Equation 17 deﬁnes the relevance between retrieved

documents and the query.

Pq=

i=1

S(qw, di)/(

i=1

S(qw, di)) (17)

Here, k0denotes the number of ﬁles retrieved by the

evaluated method , kdenotes the number of ﬁles

retrieved by plain text search, qw represents query

vector, direpresents document vector, and Sis a

function to compute the relevance score between qw

and di. Equation 18 deﬁnes the relevance of different

retrieved documents.

Pd=

j=1

i=1

S(dj, di)/(

j=1

i=1

S(dj, di)) (18)

Here, k0denotes the number of ﬁles retrieved by

the evaluated method, kdenotes the number of ﬁles

retrieved by plaintext search, and both dianddjdenote

document vector.

Equation 19 combines the relevance between query

and retrieved documents and relevance of documents

to quantify the search precision such that

Acc =aPq+Pd(19)

where afunctions as a tradeoff parameter to balance

the relevance between query and documents and rele-

vance of documents. If ais bigger than 1, it puts more

emphasis on the relevance of documents otherwise

query keywords.

The above evaluation strategies should be based on

the same dataset and keywords.

7.2 Rank Privacy

Rank privacy can quantify the information leakage of

the search results. The deﬁnition of rank privacy is

adopted from [19]. Equation 20 is used to evaluate

the rank privacy.

Pk=

i=1

Pi/k (20)

Here, kdenotes the number of top-k retrieved doc-

uments, pi=|ci0−ci|,ci0is the ranking of document

diin the retrieved top-k documents,ciis the actual

ranking of document diin the data set, and Piis set to

kif greater than k. The overall rank privacy measure

at point k, denoted as Pk, is deﬁned as the average

value of pifor every document diin the retrieved top-

k documents.

8 PERFORMANCE ANALYSIS

In order to test the performance of MRSE-HCI on

real dataset, we built an experimental platform to

test the search efﬁciency, accuracy and rank privacy.

We implemented the target experiment based on a

distributed platform which includes three ThinkServer

RD830 and a ThinkCenter M8400t. The data set is built

from IEEE Xplore, including about 51000 documents,

and 22000 keywords.

According to the notations deﬁned in section IV,

ndenotes the dictionary size, kdenotes the number

requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation

information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems

(a)Search time with the in-

creasing documents

(b)Search time with the in-

creasing number of retrieved

documents

(c)Search time with the in-

creasing number of query

keywords

Fig. 11 search efﬁciency

of top-kdocuments, mdenotes the number of docu-

ments in the data set, and wdenotes the number of

keywords in the users query.

Fig.11 is used to describe search efﬁciency with

different conditions. Fig.11 (a) describes search ef-

ﬁciency using the different size of document set

with unchanged dictionary size, number of re-

trieved documents and number of query keywords,

n= 22157, k = 20, w = 5. In Fig.11 (b), we adjust the

value of kwith unchanged dictionary size, doc-

ument set size and number of query keywords,

n= 22157, m = 51312, w = 5. Fig.11 (c) tests the dif-

ferent number of query keywords with unchanged

dictionary size, document set size and number of

retrieved documents, n= 22157, m = 51312, k = 20.

From the Fig.11 (a), we can observe that with the ex-

ponential growth of document set size, the search time

of MRSE increases exponentially, while the search

time of M RSE −HCI increases linearly. As the Fig.11

(b) and (c) shows, the search time of MRSE −HCI

keeps stable with the increase of query keywords and

retrieved documents. Meanwhile, the search time is

far below that of MRSE.

Fig.12 describes search accuracy by utilizing plain-

text search as a standard. Fig.12 (a) illustrates the

relevance of retrieved documents. With the number of

documents increases from 3200 to 51200, the ratio of

MRSE-to-plaintext search ﬂuctuates at 1, while MRSE-

HCI-to-plaintext search increases from 1.5to 2. From

the Fig.12 (a), we can observe that the relevance of

retrieved documents in the MRSE-HCI is almost twice

as many as that in the MRSE, which means retrieved

documents generated by MRSE-HCI are much closer

to each other. Fig.12 (b) shows the relevance between

query and retrieved documents. With the size of

document set increases from 3200 to 51200, the MRSE-

to-plaintext search ratio ﬂuctuates at 0.75.MRSE-HCI-

to-plaintext search ratio increases from 0.65 to 0.75

(a)Relevance of documents (b) Relevance between docu-

ments and query

Fig. 12 Search precision

Fig. 13 Rank privacy

accompanying with the growth of document set size.

From the Fig.12 (b), we can see that the relevance

between query and retrieved documents in MRSE-

HCI is slightly lower than that in MRSE. Especially,

this gap narrows when the data size increases since

a big document data set has a clear category distri-

bution which improves the relevance between query

and documents. Fig.12 (c) shows the rank accuracy ac-

cording to equation 19. The tradeoff parameter ais set

to 1, which means there is no bias towards relevance

of documents or relevance between documents and

query. From the result, we can conclude that MRSE-

HCI is better than MRSE in rank accuracy.

Fig. 13 describes the rank privacy according to

equation 20. In this test, no matter the number of

retrieved documents, MRSE −HCI has better rank

privacy than MRSE. This mainly caused by the rele-

vance of documents introduced into search strategy.

9 CONCLUSION

In this paper, we investigated ciphertext search in

the scenario of cloud storage. We explore the prob-

lem of maintaining the semantic relationship between

different plain documents over the related encrypted

documents and give the design method to enhance

the performance of the semantic search. We also

propose the MRSE-HCI architecture to adapt to the

requirements of data explosion, online information

retrieval and semantic search. At the same time, a

veriﬁable mechanism is also proposed to guarantee

requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation

information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems

the correctness and completeness of search results. In

addition, we analyze the search efﬁciency and security

under two popular threat models. An experimental

platform is built to evaluate the search efﬁciency,

accuracy, and rank security. The experiment result

proves that the proposed architecture not only prop-

erly solves the multi-keyword ranked search problem,

but also brings an improvement in search efﬁciency,

rank security, and the relevance between retrieved

documents.

10 ACKNOWLEDGEMENT

This work was supported by Strategic Priority Re-

search Program of Chinese Academy of Sciences

(No.XDA06040602) and Xinjiang Uygur Autonomous

Region science and technology plan (No.201230121).

REFERENCES

[1] S. Grzonkowski, P. M. Corcoran, and T. Coughlin, ”Security

analysis of authentication protocols for next-generation mobile

and CE cloud services,” in Proc. ICCE, Berlin, Germany, 2011,

pp. 83-87.

[2] D. X. D. Song, D. Wagner, and A. Perrig, ”Practical techniques

for searches on encrypted data,” in Proc. S &P, BERKELEY,

CA, 2000, pp. 44-55.

[3] D. Boneh, G. Di Crescenzo, R. Ostrovsky, and G. Persiano,

”Public key encryption with keyword search,” in Proc. EURO-

CRYPT, Interlaken, SWITZERLAND, 2004, pp. 506-522.

[4] Y. C. Chang, and M. Mitzenmacher, ”Privacy preserving key-

word searches on remote encrypted data,” in Proc. ACNS,

Columbia Univ, New York, NY, 2005, pp. 442-455.

[5] R. Curtmola, J. Garay, S. Kamara, and R. Ostrovsky, ”Search-

able symmetric encryption: improved deﬁnitions and efﬁcient

constructions,” in Proc. ACM CCS, Alexandria, Virginia, USA,

2006, pp. 79-88.

[6] M. Bellare, A. Boldyreva, and A. O’Neill, ”Deterministic and

efﬁciently searchable encryption,” in Proc. CRYPTO, Santa Bar-

bara, CA, 2007, pp. 535-552.

[7] D. Boneh, and B. Waters, ”Conjunctive, subset, and range

queries on encrypted data,” in Proc. TCC, Amsterdam,

NETHERLANDS, 2007, pp. 535-554.

[8] D. X. D. Song, D. Wagner, and A. Perrig, ”Practical techniques

for searches on encrypted data,” in Proc. S &P 2000, BERKE-

LEY, CA, 2000, pp. 44-55.

[9] E.-J. Goh, Secure Indexes, IACR Cryptology ePrint Archive, vol.

2003, pp. 216. 2003.

[10] C. Wang, N. Cao, K. Ren, and W. J. Lou, Enabling Secure and

Efﬁcient Ranked Keyword Search over Outsourced Cloud Data,

IEEE Trans. Parallel Distrib. Syst., vol. 23, no. 8, pp. 1467-1479,

Aug. 2012.

[11] A. Swaminathan, Y. Mao, G. M. Su, H. Gou, A. Varna, S. He, M.

Wu, and D. Oard, ”Conﬁdentiality-Preserving Rank-Ordered

Search,” in Proc. ACM StorageSS, Alexandria, VA, 2007, pp.

7-12.

[12] S. Zerr, D. Olmedilla, W. Nejdl, and W. Siberski, ”Zerber+R:

top-k retrieval from a conﬁdential index,” in Proc. EDBT, Saint

Petersburg, Russia, 2009, pp. 439-449.

[13] C. Wang, N. Cao, J. Li, K. Ren, and W. J. Lou, ”Secure Ranked

Keyword Search over Encrypted Cloud Data,” in Proc. ICDCS,

Genova, ITALY, 2010.

[14] P. Golle, J. Staddon, and B. Waters, ”Secure conjunctive key-

word search over encrypted data,” in Proc. ACNS, Yellow Mt,

China, 2004, pp. 31-45.

[15] L. Ballard, S. Kamara, and F. Monrose, ”Achieving efﬁcient

conjunctive keyword searches over encrypted data,” in Proc.

ICICS, Beijing, China, 2005, pp. 414-426.

[16] R. Brinkman, Searching in encrypted data: University of

Twente, 2007.

[17] Y. H. Hwang, and P. J. Lee, ”Public key encryption with

conjunctive keyword search and its extension to a multi-user

system,” in Proc. Pairing, Tokyo, JAPAN, 2007, pp. 2-22.

[18] H. Pang, J. Shen, and R. Krishnan, Privacy-Preserving

Similarity-Based Text Retrieval, ACM Trans. Internet. Technol.,

vol. 10, no. 1, pp. 39, Feb. 2010.

[19] N. Cao, C. Wang, M. Li, K. Ren, and W. J. Lou, ”Privacy-

Preserving Multi-keyword Ranked Search over Encrypted

Cloud Data,” in Proc. IEEE INFOCOM, Shanghai, China, 2011,

pp. 829-837.

[20] W. Sun, B. Wang, N. Cao, M. Li, W. Lou, Y. T. Hou, and

H. Li, ”Privacy-preserving multi-keyword text search in the

cloud supporting similarity-based ranking,” in Proc. ASIACCS,

Hangzhou, China, 2013, pp. 71-82.

[21] F. Li, M. Hadjieleftheriou, G. Kollios, and L. Reyzin, ”Dynamic

authenticated index structures for outsourced databases,” in

Proc. ACM SIGMOD, Chicago, IL, USA, 2006, pp. 121-132.

[22] H. H. Pang, and K. L. Tan, ”Authenticating query results in

edge computing,” in Proc. ICDE, Boston, MA, 2004, pp. 560-571.

[23] C. Martel, G. Nuckolls, P. Devanbu, M. Gertz, A. Kwong,

and S. G. Stubblebine, A general model for authenticated data

structures, Algorithmica, vol. 39, no. 1, pp. 21-41, May. 2004.

[24] C. M. Ralph, ”Protocols for Public Key Cryptosystems,” in

Proc. S &P, Oakland, CA, 1980, pp. 122-122.

[25] R. C. Merkle, A CERTIFIED DIGITAL SIGNATURE, Lect.

Notes Comput. Sci., vol. 435, pp. 218-238. 1990.

[26] M. Naor, and K. Nissim, Certiﬁcate revocation and certiﬁcate

update, IEEE J. Sel. Areas Commun., vol. 18, no. 4, pp. 561-570,

Apr. 2000.

[27] H. Pang, and K. Mouratidis, Authenticating the query results

of text search engines, Proc. VLDB Endow., vol. 1, no. 1, pp.

126-137, Aug. 2008.

[28] C. Chen, X. J. Zhu, P. S. Shen, and J. K. Hu, ”A Hierarchical

Clustering Method For Big Data Oriented Ciphertext Search,”

presented at Proc. BigSecurity, Toronto, Canada, Apr. 27-May.

2, 2014.

[29] S. C. Yu, C. Wang, K. Ren, and W. J. Lou, ”Achieving Secure,

Scalable, and Fine-grained Data Access Control in Cloud Com-

puting,” in Proc. IEEE INFOCOM, San Diego, CA, 2010, pp.

1-9.

[30] I. H. Witten, A. Moffat, and T. C. Bell, Managing gigabytes:

compressing and indexing documents and images, 2nd ed., San

Francisco: Morgan Kaufmann, 1999.

[31] J. MacQueen, ”Some methods for classiﬁcation and analysis of

multivariate observations,” in Proc. Berkeley Symp. Math. Stat.

Prob, California, USA, 1967, p. 14.

[32] Z. X. Huang, Extensions to the k-means algorithm for cluster-

ing large data sets with categorical values, Data Min. Knowl.

Discov., vol. 2, no. 3, pp. 283-304, Sep. 1998.

[33] W. K. Wong, D. W. Cheung, B. Kao, and N. Mamoulis, ”Secure

kNN Computation on Encrypted Databases,” in Proc. ACM

SIGMOD, Providence, RI, 2009, pp. 139-152.

[34] R. X. Li, Z. Y. Xu, W. S. Kang, K. C. Yow, and C. Z. Xu,

Efﬁcient multi-keyword ranked query over encrypted data in

cloud computing, Futur. Gener. Comp. Syst., vol. 30, pp. 179-

190, Jan. 2014.

[35] G. Craig. ”Fully homomorphic encryption using ideal lattices.”

STOC. Vol. 9. 2009

[36] D. Boneh, G. Di Crescenzo, R. Ostrovsky, et al. Public key

encryption with keyword search[C].Advances in Cryptology-

Eurocrypt 2004. Springer Berlin Heidelberg, 2004: 506-522.

[37] D. Cash, S. Jarecki, C. Jutla, et al. Highly-scalable search-

able symmetric encryption with support for boolean que-

ries[M].Advances in CryptologyCRYPTO 2013. Springer Berlin

Heidelberg, 2013: 353-373.

[38] S. Kamara, C. Papamanthou, T.Roeder. Dynamic searchable

symmetric encryption[C].Proceedings of the 2012 ACM con-

ference on Computer and communications security. ACM, 2012:

965-976.

[39] Curtmola R, Garay J, Kamara S, et al. Searchable symmet-

ric encryption: improved deﬁnitions and efﬁcient construc-

tions[C]//Proceedings of the 13th ACM conference on Com-

puter and communications security. ACM, 2006: 79-88.

[40] Chase, M., and Kamara, S. (2010). Structured encryption and

controlled disclosure. In Advances in Cryptology-ASIACRYPT

2010 (pp. 577-594). Springer Berlin Heidelberg

requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation

information: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems

[41] Cash, D., Jaeger, J., Jarecki, S., Jutla, C., Krawczyk, H., Rosu,

M. C., and Steiner, M. (2014). Dynamic searchable encryption

in very large databases: Data structures and implementation.

In Proc. of NDSS (Vol. 14)

[42] Jarecki, S., Jutla, C., Krawczyk, H., Rosu, M., and Steiner,

M. (2013, November). Outsourced symmetric private infor-

mation retrieval. In Proceedings of the 2013 ACM SIGSAC

conference on Computer and communications security (pp. 875-

888). ACM.

Chi Chen (M14) received the B.S.(2000) and

M.S.(2003) degree from Shandong Univer-

sity, Jinan, China. He received PH.D.(2008)

degree from Institute of Software Chinese

Academy of Sciences, Beijing, China. He is

associate research fellow of Institute of In-

formation Engineering, Chinese Academy of

Sciences. His research interest includes the

cloud security and database security. From

2003 to 2011, he was a research apprentice,

research assistant and associate research

fellow with the State Key Laboratory of Information Security, institute

of software Chinese Academy of Sciences. Since 2012, he is an as-

sociate research fellow with the State Key Laboratory of Information

Security, institute of information engineering, Chinese Academy of

Sciences, Beijing, China.

Xiaojie Zhu received the B.S. degree in Zhe-

jiang University of Technology, HangZhou,

China, in 2011.He is currently pursuing the

MS degree in Institute of Information Engi-

neering, Chinese Academy of Sciences. His

research interest includes the information re-

trieval, secure cloud storage and data secu-

rity.

Peisong Shen received the B.S. degree

in University of Science and Technology of

China, HeFei, China, in 2012. He is currently

pursuing the Ph.D. degree in Institute of In-

formation Engineering, Chinese Academy of

Sciences. His research interest includes the

information retrieval, secure cloud storage

and data security.

Jiankun Hu is a Professor and Research Di-

rector of Cyber Security Lab, The University

of New South Wales, Canberra, Australia. He

has obtained 7 ARC (Australian Research

Council) Grants and is now serving at the

prestigious Panel of Mathematics, Informa-

tion and Computing Sciences, ARC ERA

Evaluation Committee.

Song Guo received his Ph.D. in computer

science from University of Ot-tawa, Canada.

From 2001 to 2006, he worked as chief soft-

ware architect for Liska Biometry Inc., NH,

USA. Dr. Guo also held a position with the

Department of Electrical and Computer Engi-

neering, the University of British Columbia on

a prestigious NSERC (Natural Sciences and

Engineering Research Council of Canada)

Postdoctoral Fellowship in 2006. From 2006

to 2007, he was an Assistant Professor at

the Department of Computer Science, University of Northern British

Columbia, Canada. He is currently a Full Professor with the School

of Computer Science and Engineering, the University of Aizu,Japan.

Zahir Tari received the degree in mathemat-

ics from University of Science and Technol-

ogy Houari Boumediene, Bab-Ezzouar, Al-

geria, in 1984, the Masters degree in op-

erational research from the Uni-versity of

Grenoble, Grenoble, France, in 1985, and

the PhD degree in computer science from

the University of Grenoble, in 1989. He is a

Professor (in distributed systems) at RMIT

Univer-sity, Melbourne, Australia. Later, he

joined the Database Labora-tory at EPFL

(Swiss Federal Institute of Technology, 1990-1992) and then moved

to QUT (Queensland University of Technology, 1993-1995) and

RMIT (Royal Melbourne Institute of Technology, since 1996). He is

the Head of the DSN (Distributed Systems and Networking) at the

School of Computer Scienceand IT, where he pursues high impact

research and development incomputer science. He leads a few

research groups that focus on some of the core areas, including net-

working (QoS routing, TCP/IP conges-tion),distributed systems (per-

formance, security, mobility, relia-bility), and distributed applications

(SCADA, Web/Internet ap-plications, mobile applications).His recent

research interests are in performance (in Cloud) and security (in

SCADA systems). Dr. Tari regularly publishes in prestigious journals

(like IEEE Transactions on Parallel and Distributed Systems, IEEE

Trans-actions on Web Services, ACM Transactions on Databases)

and conferences (ICDCS, WWW, ICSOC etc.). He co-authored two

books (John Wiley) and edited more than 10 books. He has been

the Program Committee Chair of several international conferences,

including the DOA (Distributed Object and Appli-cation Symposium),

IFIP DS 11.3 on Database Security, and IFIP 2.6 on Data Semantics.

He has also been the General Chair of more than 12 conferences.

He is the recipient of 14 ARC (Australian ResearchCouncil) grants.

He is a senior member of the IEEE.

Albert Y. Zomaya is currently the Chair Pro-

fessor of High Performance Computing and

Networking and Australian Research Council

Professorial Fellow in the School of Informa-

tion Technologies, The University of Sydney,

Sydney, Australia. He is also the Director

of the Centre for Distributed and High Per-

formance Computing which was established

in late 2009. He is the author/co-author of

seven books, more than 370 papers, and

the editor of nine books and 11 conference

proceedings. Prof. Zomaya is the Editor in Chief of the IEEE Trans-

actions on Computers and serves as an Associate Editor for 19

leading journals. He is the recipient of the Meritorious Service Award

(in 2000) and the Golden Core Recognition (in 2006), both from

the IEEE Computer Society. He is a Chartered Engineer (CEng),

a Fellow of the AAAS, the IEEE, the IET (UK), and a Distinguished

Engineer of the ACM.

Privacy-Preserving and Trusted Keyword Search for Multi-Tenancy Cloud

Article

Full-text available

Jan 2024

Cloud service models intrinsically cater to multiple tenants. In current multi-tenancy model, cloud service providers isolate data within a single tenant boundary with no or minimum cross-tenant interaction. With the booming of cloud applications, allowing a user to search across tenants is crucial to utilize stored data more effectively. However, conducting such a search operation is inherently risky, primarily due to privacy concerns. Moreover, existing schemes typically focus on a single tenant and are not well suited to extend support to a multi-tenancy cloud, where each tenant operates independently. In this article, to address the above issue, we provide a privacy-preserving, verifiable, accountable, and parallelizable solution for “privacy-preserving keyword search problem" among multiple independent data owners. We consider a scenario in which each tenant is a data owner and a user’s goal is to efficiently search for granted documents that contain the target keyword among all the data owners. We first propose a verifiable yet accountable keyword searchable encryption (VAKSE) scheme through symmetric bilinear mapping. For verifiability, a message authentication code (MAC) is computed for each associated piece of data. To maintain a consistent size of MAC, the computed MACs undergo an exclusive OR operation. For accountability, we propose a keyword-based accountable token mechanism where the client’s identity is seamlessly embedded without compromising privacy. Furthermore, we introduce the parallel VAKSE scheme, in which the inverted index is partitioned into small segments and all of them can be processed synchronously. We also conduct formal security analysis and comprehensive experiments to demonstrate the data privacy preservation and efficiency of the proposed schemes, respectively.

A hierarchical attribute based encryption scheme is designed for document collection

Conference Paper

Jan 2024

Privacy Preserving Multi-keyword Top-K Search based on Cosine Similarity Clustering

Article

Full-text available

Oct 2023

Cloud computing provides the facility to store and manage data remotely. The volume of information is increasing per day. The owners choose to store the sensitive data on the cloud storage. To protect the data from unauthorized accesses, the data must be uploaded in encrypted form. Due to large amount of information is stored on the cloud storage; the association between the documents is hiding when the documents are encrypted. It is necessary to design a search technique which gives the results on the basis of the similarity values of the encrypted documents. In this paper a cosine similarity clustering method is proposed to make the clusters of similar documents based on the cosine values of the document vectors. We also proposed a MRSE-CSI model used to search the documents which are in encrypted form. The proposed search technique only finds the cluster of documents with the highest similarity value instead of searching on the whole dataset. Processing the dataset on two algorithms shows that the time needed to form the clusters in the proposed method is less. When the documents in the dataset increases, the time needed to form clusters also increases. The result of the search shows that increasing the documents also increases the search time of the proposed method.

Privacy‐preserving data integration scheme in industrial robot system based on fog computing and edge computing

Article

Full-text available

Mar 2024

To solve the security problems of the moving robot system in the fog network of the Industrial Internet of Things (IIoT), this paper presents a privacy‐preserving data integration scheme in the moving robot system. First, a novel data collection enhancement algorithm is proposed to enhance the image effects, and a k‐anonymous location and data privacy protection protocol based on Ad hoc network (Ad hoc‐based KLDPP protocol) is designed in secure data collection phase to protect the privacy of location and network data. Second, the secure multiparty computation with verifiable key sharing is introduced to realize the valid computation against share cheating in the robot system. Third, the ciphertext classification method in a neural network is considered in the secure data storage process to realize the special application. Finally, experiments and simulations are conducted on the robot system of fog computing in the IIoT. The results demonstrate that the proposed scheme can improve the security and efficiency of the said robot system.

Secure semantic search using deep learning in a blockchain-assisted multi-user setting

Article

Full-text available

Jan 2024

Deep learning-based semantic search (DLSS) aims to bridge the gap between experts and non-experts in search. Experts can create precise queries due to their prior knowledge, while non-experts struggle with specific terms and concepts, making their queries less precise. Cloud infrastructure offers a practical and scalable platform for data owners to upload their data, making it accessible to intended data users. However, the contemporary single-owner/single-user (S/S) approach to DLSS schemes falls short of effectively leveraging the inherent multi-user capabilities of cloud infrastructure. Furthermore, most of these schemes delegate the dissemination of secret keys to a single trust point within the mutual distrust scenario in cloud infrastructure. This paper proposes a Secure Semantic Search using Deep Learning in a Blockchain-Assisted Multi-User Setting (S3DBMS)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(S^3DBMS)$$\end{document}. Specifically, the seamless integration of attribute-based encryption with transfer learning allows the construction of DLSS in multi-owner/multi-user (M/M) settings. Further, blockchain’s smart contract mechanism allows a multi-attribute authority consensus-based generation of user private keys and system-wide global parameters in a mutual distrust M/M scenario. Finally, our scheme achieves privacy requirements and offers improved security and accuracy.

Optimized Homomorphic Encryption (OHE) algorithms for protecting sensitive image data in the cloud computing environment

Article

May 2024

This research aims to enhance photo encryption security by developing a sophisticated technique. This method uses homomorphic encryption to address challenges in encrypting visible spectrum pictures. Each red–green–blue (RGB) channel of the image is divided into smaller sub-values, encrypted separately using an optimized homomorphic encryption algorithm, and then combined for further encryption. Additionally, a novel approach involves combining surrounding pixels to embed extra data during encryption. The process allows for compression and decompression of encrypted components for easier storage or transmission. After decryption, the initial pixel values are recovered, removing any unnecessary data and condensing each channel's pixel intensity into just two sub-values. Multiple security evaluations confirm the method's robustness and resistance, emphasizing its strong security features for encrypted images.

EPSMR: An efficient privacy-preserving semantic-aware multi-keyword ranked search scheme in cloud

Article

May 2024
FUTURE GENER COMP SY

Multi-keyword ranked search with access control for multiple data owners in the cloud

Article

May 2024

Verifiable Privacy-Preserving Image Retrieval in Multi-Owner Multi-User Settings

Article

Apr 2024

Recently, the Convolutional Neural Network (CNN) based Content-Based Image Retrieval (CBIR) has substantially improved the search accuracy of encrypted images. Further, the increasing trends in outsourcing the CNN-based CBIR service to the cloud relieve the users from severe computation and storage requirements. However, all of the existing CNN-based CBIR schemes lack the support for Multi-owner multi-user settings and thus significantly limit the flexibility and scalability of cloud computing. To fill this gap, we propose a V erifiable P rivacy-preserving I mage R etrieval scheme in the M ulti-owner multi-user setting (VPIRM). VPIRM utilizes a two-phase transfer learning technique. In the first phase, convolution base transfer takes the pre-trained CNN model for feature extraction, which addresses the issue of scarce training data at the image owner (IO) side. In the second phase, novel secure transfer enables the image user (IU) to construct a query feature vector over the same feature space on which the model is trained. Meanwhile, our scheme simultaneously supports fine-grained access control, dynamic updates, and results correctness and completeness on a malicious cloud server. Finally, a thorough security analysis shows that the scheme achieves various privacy requirements under the known-ciphertext and known-background threat model.

A Survey on Searchable Symmetric Encryption

Article

Oct 2023

Outsourcing data to the cloud has become prevalent, so Searchable Symmetric Encryption (SSE), one of the methods for protecting outsourced data, has arisen widespread interest. Moreover, many novel technologies and theories have emerged, especially for the attacks on SSE and privacy-preserving. But most surveys related to SSE concentrate on one aspect (e.g., single keyword search, fuzzy keyword search, etc.) or lack in-depth analysis. Therefore, we revisit the existing work and conduct a comprehensive analysis and summary. We provide an overview of state of the art in SSE and focus on the privacy it can protect. Generally, (1) we study the work of the past few decades and classify SSE based on query expressiveness. Meanwhile, we summarize the existing schemes and analyze their performance on efficiency, storage space, index structures, etc.; (2) we complement the gap in the privacy of SSE and introduce in detail the attacks and the related defenses; (3) we discuss the open issues and challenges in existing schemes and future research directions. We desire that our work will help novices to grasp and understand SSE comprehensively. We expect it can inspire the SSE community to discover more crucial leakages and design more efficient and secure constructions.

Privacy-preserving Multi-keyword Text Search in the Cloud Supporting Similarity-based Ranking

Conference Paper

Full-text available

Jun 2013

With the increasing popularity of cloud computing, huge amount of documents are outsourced to the cloud for reduced management cost and ease of access. Although en-cryption helps protecting user data confidentiality, it leaves the well-functioning yet practically-efficient secure search functions over encrypted data a challenging problem. In this paper, we present a privacy-preserving multi-keyword text search (MTS) scheme with similarity-based ranking to address this problem. To support multi-keyword search and search result ranking, we propose to build the search index based on term frequency and the vector space model with cosine similarity measure to achieve higher search result accuracy. To improve the search efficiency, we propose a tree-based index structure and various adaption methods for multi-dimensional (MD) algorithm so that the practical search efficiency is much better than that of linear search. To further enhance the search privacy, we propose two secure index schemes to meet the stringent privacy requirements under strong threat models, i.e., known ciphertext model and known background model. Finally, we demonstrate the effectiveness and efficiency of the proposed schemes through extensive experimental evaluation.

A hierarchical clustering method for big data oriented ciphertext search

Conference Paper

Full-text available

Apr 2014

Following the wide use of cloud services, the volume of data stored in the data center has experienced a dramatically growth which makes real-time information retrieval much more difficult than before. Furthermore, text information is usually encrypted before being outsourced to data centers in order to protect users' data privacy. Current techniques to search on encrypted data do not perform well within such a massive data environment. In this paper, a hierarchical clustering method for ciphertext search within a big data environment is proposed. The proposed approach clusters the documents based on the minimum similarity threshold, and then partitions the resultant clusters into sub-clusters until the constraint on the maximum size of cluster is reached. In the search phase, this approach can reach a linear computational complexity against exponential size of document collection. In addition, retrieved documents have a better relationship with each other than traditional methods. An experiment has been conducted using the collection set built from the recent ten years' IEEE INFOCOM publications, including about 3000 documents with nearly 5300 keywords. The results have validated our proposed approach.

Dynamic Searchable Encryption in Very-Large Databases: Data Structures and Implementation

Conference Paper

Full-text available

Jan 2014

Highly-Scalable Searchable Symmetric Encryption with Support for Boolean Queries

Article

Full-text available

Jan 2013

This work presents the design and analysis of the first searchable symmetric encryption (SSE) protocol that supports conjunctive search and general Boolean queries on outsourced symmetrically- encrypted data and that scales to very large databases and arbitrarily-structured data including free text search. To date, work in this area has focused mainly on single-keyword search. For the case of conjunctive search, prior SSE constructions required work linear in the total number of documents in the database and provided good privacy only for structured attribute-value data, rendering these solutions too slow and inflexible for large practical databases. In contrast, our solution provides a realistic and practical trade-off between performance and privacy by efficiently supporting very large databases at the cost of moderate and well-defined leakage to the outsourced server (leakage is in the form of data access patterns, never as direct exposure of plaintext data or searched values). We present a detailed formal cryptographic analysis of the privacy and security of our protocols and establish precise upper bounds on the allowed leakage. To demonstrate the real-world practicality of our approach, we provide performance results of a prototype applied to several large representative data sets, including encrypted search over the whole English Wikipedia (and beyond).

Outsourced symmetric private information retrieval

Conference Paper

Full-text available

Nov 2013

In the setting of searchable symmetric encryption (SSE), a data owner D outsources a database (or document/file collection) to a remote server E in encrypted form such that D can later search the collection at E while hiding information about the database and queries from E. Leakage to E is to be confined to well-defined forms of data-access and query patterns while preventing disclosure of explicit data and query plaintext values. Recently, Cash et al. presented a protocol, OXT, which can run arbitrary boolean queries in the SSE setting and which is remarkably efficient even for very large databases. In this paper we investigate a richer setting in which the data owner D outsources its data to a server E but D is now interested to allow clients (third parties) to search the database such that clients learn the information D authorizes them to learn but nothing else while E still does not learn about the data or queried values as in the basic SSE setting. Furthermore, motivated by a wide range of applications, we extend this model and requirements to a setting where, similarly to private information retrieval, the client's queried values need to be hidden also from the data owner D even though the latter still needs to authorize the query. Finally, we consider the scenario in which authorization can be enforced by the data owner D without D learning the policy, a setting that arises in court-issued search warrants. We extend the OXT protocol of Cash et al. to support arbitrary boolean queries in all of the above models while withstanding adversarial non-colluding servers (D and E) and arbitrarily malicious clients, and while preserving the remarkable performance of the protocol.

Verifiable Privacy-Preserving Multi-Keyword Text Search in the Cloud Supporting Similarity-Based Ranking

Article

Full-text available

May 2013

With the growing popularity of cloud computing, huge amount of documents are outsourced to the cloud for reduced management cost and ease of access. Although encryption helps protecting user data confidentiality, it leaves the well-functioning yet practically-efficient secure search functions over encrypted data a challenging problem. In this paper, we present a verifiable privacy-preserving multi-keyword text search (MTS) scheme with similarity-based ranking to address this problem. To support multi-keyword search and search result ranking, we propose to build the search index based on term frequency- and the vector space model with cosine similarity measure to achieve higher search result accuracy. To improve the search efficiency, we propose a tree-based index structure and various adaptive methods for multi-dimensional (MD) algorithm so that the practical search efficiency is much better than that of linear search. To further enhance the search privacy, we propose two secure index schemes to meet the stringent privacy requirements under strong threat models, i.e., known ciphertext model and known background model. In addition, we devise a scheme upon the proposed index tree structure to enable authenticity check over the returned search results. Finally, we demonstrate the effectiveness and efficiency of the proposed schemes through extensive experimental evaluation.

Secure conjunctive keyword search over encrypted data

Article

Jan 2004

We study-the setting in which a user stores encrypted documents (e.g. e-mails) on an untrusted server. In order to retrieve documents satisfying a certain search criterion, the user gives the server a capability that allows the server to identify exactly those documents. Work in this area has largely focused on search criteria consisting of a single keyword. If the user is actually interested in documents containing each of several keywords (conjunctive keyword search) the user must either give the server capabilities for each of the keywords individually and rely on an intersection calculation (by either the server or the user) to determine the correct set of documents, or alternatively, the user may store additional information on the server to facilitate such searches. Neither solution is desirable; the former enables the server to learn which documents match each individual keyword of the conjunctive search and the latter results in exponential storage if the user allows for searches on every set of keywords. We define a security model for conjunctive keyword search over encrypted data and present the first schemes for conducting such searches securely. We propose first a scheme for which the communication cost is linear in the number of documents, but that cost can be incurred "offline" before the conjunctive query is asked. The security of this scheme relies on the Decisional Diffie-Hellman (DDH) assumption. We propose a second scheme whose communication cost is on the order of the number of keyword fields and whose security relies on a new hardness assumption.

Searching in encrypted data

Article

Jan 2007

Richard Brinkman

Efficient multi-keyword ranked query over encrypted data in cloud computing

Article

Jan 2014

Cloud computing infrastructure is a promising new technology and greatly accelerates the development of large scale data storage, processing and distribution. However, security and privacy become major concerns when data owners outsource their private data onto public cloud servers that are not within their trusted management domains. To avoid information leakage, sensitive data have to be encrypted before uploading onto the cloud servers, which makes it a big challenge to support efficient keyword-based queries and rank the matching results on the encrypted data. Most current works only consider single keyword queries without appropriate ranking schemes. In the current multi-keyword ranked search approach, the keyword dictionary is static and cannot be extended easily when the number of keywords increases. Furthermore, it does not take the user behavior and keyword access frequency into account. For the query matching result which contains a large number of documents, the out-of-order ranking problem may occur. This makes it hard for the data consumer to find the subset that is most likely satisfying its requirements. In this paper, we propose a flexible multi-keyword query scheme, called MKQE to address the aforementioned drawbacks. MKQE greatly reduces the maintenance overhead during the keyword dictionary expansion. It takes keyword weights and user access history into consideration when generating the query result. Therefore, the documents that have higher access frequencies and that match closer to the users’ access history get higher rankings in the matching result set. Our experiments show that MKQE presents superior performance over the current solutions.

Dynamic Searchable Symmetric Encryption

Article

Oct 2012

Searchable symmetric encryption (SSE) allows a client to encrypt its data in such a way that this data can still be searched. The most immediate application of SSE is to cloud storage, where it enables a client to securely outsource its data to an untrusted cloud provider without sacrificing the ability to search over it. SSE has been the focus of active research and a multitude of schemes that achieve various levels of security and efficiency have been proposed. Any practical SSE scheme, however, should (at a minimum) satisfy the following properties: sublinear search time, security against adaptive chosen-keyword attacks, compact indexes and the ability to add and delete files efficiently. Unfortunately, none of the previously-known SSE constructions achieve all these properties at the same time. This severely limits the practical value of SSE and decreases its chance of deployment in real-world cloud storage systems. To address this, we propose the first SSE scheme to satisfy all the properties outlined above. Our construction extends the inverted index approach (Curtmola et al., CCS 2006) in several non-trivial ways and introduces new techniques for the design of SSE. In addition, we implement our scheme and conduct a performance evaluation, showing that our approach is highly efficient and ready for deployment.

An Efficient Privacy-Preserving Ranked Keyword Search Method

Abstract and Figures

Recommended publications

Designing a Bit-Based Model to Accelerate Query Processing Over Encrypted Databases in Cloud

Designing a Bit-Based Model to Accelerate Query Processing Over Encrypted Databases in Cloud

A hierarchical clustering method for big data oriented ciphertext search

Efficient privacy-preserving keyword search method for retrieving data from cloud

Multi-keyword ranked search supporting synonym query over encrypted data in cloud computing

Secure Ranked Keyword Search Method with Conditional Random Fields over Encrypted Cloud Data