Conference PaperPDF Available

Poshida, a protocol for private information retrieval

August 2016

August 2016

DOI:10.1109/INTECH.2016.7845060

Conference: 2016 Sixth International Conference on Innovative Computing Technology (INTECH)

Authors:

Mohib Ullah

Khyber Pakhtunkhwa Agricultural University, Peshawar

Rafiullah Khan

Macquarie University

Muhammad Arshad Islam

FAST-NUCES Islamabad

Web Search Engines are the easy source to get data from Internet. WSE retrieve the information from the ocean of data according to the user generated queries. In return the WSE records the queries to build the user profile and retrieves the personalize search results. WSE sells the query log to marketing companies to generate revenue. This poses a threat to user privacy. In 2006 AOL released query log for research purpose but those logs were not properly anonymized, lead to the discovery of users. Performing private web search and achieving privacy is the active area of research. Many technique has been proposed to hide the identity of users from the WSE, However state of the art scheme is yet to device to privately retrieve information from WSE. This paper proposes a new protocol to obfuscate the user profile which is maintained by WSE. Profile Exposure level is used to measure the level of privacy a user achieve from WSE. Results show the protocol hides 50 percent of user profile at first degree 70% at second, 77% at third and more than 85% at fourth degree.

Content uploaded by Rafiullah Khan

Content may be subject to copyright.

Poshida, A Protocol for Private Information

Retrieval

Mohib Ullah, Rafiullah Khan Muhammad Arshad Islam

IBMS/CS Department of Computer Science

The University of Agriculture Capital University of Science and Technology

Peshawar, Pakistan. Islamabad, Pakistan

mohibullah@aup.edu.pk, rafiyz@aup.edu.pk arshad.islam@cust.edu.pk

Abstract—Web Search Engines are the easy source to

get data from Internet. WSE retrieve the information

from the ocean of data according to the user generated

queries. In return the WSE records the queries to build

the user profile and retrieves the personalize search

results. WSE sells the query log to marketing companies

to generate revenue. This poses a threat to user privacy.

In 2006 AOL released query log for research purpose

but those logs were not properly anonymized, lead to

the discovery of users. Performing private web search

and achieving privacy is the active area of research.

Many technique has been proposed to hide the identity

of users from the WSE, However state of the art scheme

is yet to device to privately retrieve information from

WSE. This paper proposes a new protocol to obfuscate

the user profile which is maintained by WSE. Profile

Exposure level is used to measure the level of privacy a

user achieve from WSE. Results show the protocol hides

50 percent of user profile at first degree 70% at second,

77% at third and more than 85% at fourth degree.

Keywords: Web Search Engine Privacy, Profile

Exposure Level, Profile Obfuscation;

I. INTRODUCTION

Internet is the enormous warehouse of data,

holds range of material and almost contains

information about anything. People from every

category, class or country definitely need information

resting on WWW. The Web Search Engine (WSE)

like Google, Ask, Bing, AOL, Baidu etc. let us to

retrieve relevant information from the web through

search queries. In searching and information retrieval

process the WSEs records all submitted queries

called a query log. A typical query log may contain

user’s query content, machine IP address, operating

system details, browser type, query’s date & time,

browser language, some important preferences, and

cookies that possibly used to matchlessly recognize

the user’s browser [1,2]. WSEs claims that they

evaluate the query log through certain algorithms for

a long period to profile & categories the users

according to their interests. A typical query is almost

three words long, which reveals very less information

about the actual interest of the user. Sometime certain

words can cause ambiguity for example a search

query on keyword “mouse”, it is an animal and a

computer device. WSEs uses the user’s profile to

show the relevant results [2, 3].

The query log is a precious resource to the

WSEs. WSE often sell the query log for business

purpose to the marketing companies [4]. The query

log frequently hold sensitive evidence about the

people, and the dissemination of such data violates

one’s privacy [5]. In several occasion user queries

contains important information like Unique User ID,

name, user’s employers details, location, religion,

health information, gender orientation, politics, faith,

believes etc. which can be exceptionally sensitive for

the possessor [6]. Release of such information poses

a serious risk to the user privacy, which can be

compromised. Among many, one of the key risk is

the disclosure to third party (e.g. advertiser, media

etc.)[5, 22] for business purposes or finding

information about product of competitor companies.

The major privacy scandal are the release of AOL log

in 2006 where twenty million queries generated by

658000 users in three months of time were published

for research purpose. Before the release the Query

log were anonymized, the IP addresses were replaced

with unique ID (pseudonym). A user named “Thelma

Arnold” with ID 4417749 was successfully

discovered [8, 9] as Query log were not appropriately

anonymized. Another episode when US department

of justice issued a summon [2] to AOL, Yahoo,

Google and Microsoft to provide a query log to find

out, if the internet filters are effectively protecting

children from adult contents on internet? as a part of

litigation of an Internet child safety law [2]. These

incidents put a question mark on WSEs policy about

user’s privacy. Many users demanded that WSE

should not maintain the query log [11].

464

A. Related work:

So far many techniques have been proposed to

obfuscate the user profile. These technique are

generally classified into four categories i.e.

standalone schemes, third party infrastructure, query

scrambling and distributed scheme. Standalone

schemes like TrackMeNot and GooPIR protects the

privacy of lone user. TrackMeNot is Firefox plugin to

programmatically create a query seed file and send

some noise queries with original query. Using this

will hide user’s original query among noise queries

and thus obfuscate his profile. GooPIR mask the

original query with k-1 false queries. In both case the

machine generated queries can be distinguished from

real human queries. Third party infrastructure like

Scroogle1, anonymizer2 etc. are proxy services in

which user forward his queries to proxy server and

the server forwards it to the WSE. The problem

stands the same i.e. profiling can now be done at

proxy server. TOR (The Onion Routing) is the group

of volunteer operated network servers to provide

privacy [10].TOR provides the anonymity at network

layer. But TOR was not specifically designed for

Web anonymity, WSE can identify a TOR user at

application layer. Query Scrambling is a novel

technique [8] proposed to obfuscate the user profile,

this technique never tries to hide the identity of user

instead the user query is scrambled in such a way that

it loosely correspond to the user’s genuine interest

thus distorts the user profile. The real query is

divided into multiple scrambled queries. Those

queries are then submitted to the WSE, results are

collected, and a scrambled ranking is applied to the

result list to get the actual interest of user called

descrambling [8]. Distributed schemes work by the

cooperation of multiple users, to diffuse the identity

of users in the group. This technique works when

group of users collaborate with each other. Each user

forward someone else's query in return his query is

forward by another user thus obfuscating the profile

with real human queries.

Private information retrieval (PIR) was the first

technique proposed to retrieve an element from the

database, hopes not to let the database know what

item the user interested in. Crowds [13] was the first

distributed scheme introduced to attain anonymity.

Crowds was consist of client side software called

jondo and server side software called blender. For

making a web transaction a user first needs to join the

crowds, then a biased coin is flipped to determine

whether directly send it to WSE or forward it to other

1 http://scroogle.org

2 http://www.anonymizer.com.

member of crowd. The reply from the WSE come

through the same path. Crowds succeed in some

degree of anonymity against the WSE but no privacy

against the local eavesdropping, also the query and

answer remain visible to all member of crowd in a

path between the originating user and WSE.

User Private Information Retrieval UPIR

considered the community of users sharing common

memory location for writing queries, reading queries

and recording answer of WSE [14]. Three flavors of

UPIR are one to one, all to all and configuration

based UPIR protocols introduced from time to time.

Each method had some weaknesses and unable to

provide privacy. Optimal configuration of peer to

peer network using combinatoric configuration

(v,b,r,k) design presented to achieve the privacy[15]

but [18] managed to compromise their technique with

intersection attack and algebraic functions.

Useless User Profile (UUP) protocol [17] yet another

distributed concept involving user, central node and

WSE. Authors in [7] investigated that UUP is

unsecure in the presence of even a single malicious

user. UUP’s privacy was compromised by [7] with

four types of attack, then modified the UUP with

double encryption. Authors in [7] managed to

achieve higher privacy with considerable delay.

However [19] made an attack on [7] and concluded

that user privacy of [7] technique can be

compromised at verification stage.

Existing social network can be used for private

web search investigated by [21, 24]. Those

approaches achieved privacy from the WSE but

query remained visible among the social network

peers. Each user forwarding an unencrypted query, if

multiple users collaborate together they could find

the query initiator through predecessor attack [20].

Authors in [22] updated the UUP concept and made it

secure in the presence of untrusted partner by using

the Optimized Arbitrary Size (OAS) Benes Network

and ElGamal group key encryption. Queries are

shuffled in such a way that no one should be able to

link any query with the user. PEP and DISPEP a zero

knowledge proof technique used for proving privacy.

Authors in [22] used a group size of three however

using group size of four and five users put significant

delay due to complex shuffling. They achieved

considerable privacy at local side but the result are

broadcasted in clear text, so everyone knows what is

being searched in a group. WSE can also manage to

find the user through intersection attack.

B. Ethical Issues:

Using any distributed system protocol each user

forwards other user’s query. However this is an

automated process, user is not required to check

465

queries manually. In worst case a user may forward a

dangerous query of some other user. In such case an

innocent may be caught over query which he never

owned. Such situation may be avoided if each query

is filtered out before forwarding to WSE. Filtering

each query and classifying into innocent and

dangerous category another field of research and is

beyond the scope of this paper. Authors in [24]

escalated the same issue, also provided a liability

mechanism to track down the originator of the

dangerous query. Liability mechanism put extra

burden and degrades the system performance.

However in future a comprehensive solution is

required to deal the problem. This work is not

considering the ethical issues primarily. It is assumed

that all users are making benign queries and no one

forwards a dangerous query.

C. Plan of this paper:

This work proposes a new protocol "Poshida, A

Protocol for Private Information Retrieval". Poshida

aims to enhance the privacy of a user both locally and

against the WSE by obfuscating the user profile.

Poshida first goal is to privately retrieve information

from WSE while concealing the user identity. Second

is to evaluate the privacy a user preserves against the

peer node, central server and WSE. Poshida is

initially simulated with real data extracted from AOL

query log dataset of 100 users. The privacy of user

will be evaluated with the privacy metric Profile

Exposure level (PEL) suggested by [21, 22, 24]. PEL

estimates the privacy level a user achieve against the

WSE while executing the proposed protocol.

Section II describe the Background and notation,

section III explains System Model, IV Pohsida

overview, V privacy of proposed Protocol, VI Results

and Discussion and VII conclusion and future work.

II. BACKGROUND AND NOTATIONS

A. RSA Algorithm:

Asymmetric cryptographic algorithm introduced

in 1977 uses two different key for encryption and

decryption [25]. The cryptographic operation

performed by RSA algorithm in three steps.

Key Generation: following steps are necessary to

generate the public, private keys

Steps:

1. Choose two random large prime numbers x, y of

similar length.

2. Calculate n=x*y

3. Compute Euler’s phi function Ø(n) such that

Ø(n) = (x-1) * (y-1).

4. Pick an integer ‘e’ such that 1< e < Ø(n) and GCD

of e and Ø(n) is 1.

5. Public Key: The pair of numbers (n, e) forms RSA

public key and it is made public.

6. Private Key‘d’ is computed from x, y and e.

Where d is the multiplicative inverse of e mod Ø(n)

i.e. d = e-1 (mod Ø(n)), private key (d, n) is kept

private

Encryption: To encrypt a plain text P, it is first

represented as a series of number using settled upon

revocable protocol, known as padding scheme. The

encryption procedure is accomplished over

mathematical step eq. (1)

 =    (1)

Decryption: The decryption of RSA is simple the

cipher text c is power with private key d then

modulus n to get the plain text,

 =    (2)

B. Random Selection Method:

Each user downloads a list of all connected users

from CS. During a query shuffling process the query

originating user selects another user randomly from

the list and forward his query. The probability of

selecting any user is same for the rest of the node i.e.

1/n, where n is the total number of users.

C. Privacy Evaluation:

Authors in [21, 24] introduced a term profile

exposure level (PEL) to measure the privacy of a user

from WSE. PEL uses mutual information and entropy

to measure the level of user profile exposure.

 = (,)

()∗ 100 (3)

Where M represent the set of categories of queries

which user actually generates, N represents the set of

categories of queries what user send to the WSE, N

contains the other user’s queries categories

H (M) is the entropy of M,

()= − ∑().  p(m) (4)

I (M, N) is the mutual information

(,)= ()−  (/) (5)

(,)=∑(/).()., (



()) (6)

 (/) is the conditional entropy. p(m) and p(n)

are the probabilities of each element of M and N

proportional to its cardinal. In this work PEL will be

used to measure the percentage of information

exposed when user forwards other users queries.

III. SYSTEM MODEL

466

This segment define the entities that participate

in Poshida. We have the following entities in Poshida

architecture:

i. User

ii. Central Server

iii. Query forwarding node

iv. Web Search Engine

A. User:

The person who is actually making a web search

query. We have honest users, curious users and

dishonest user. Honest user aims is to protect his / her

privacy, whereas the curious user and dishonest

motivation is to find the queries of honest users.

B. Central Server (CS):

A server which supervises the connection of

user to the system. Provides each user with the list of

all connected users. CS do the Selection and

announcement of Query Forwarding Node (QFN).

When a user / node connects to central Server, the CS

records the user’s IP address and port number. The

CS distribute this information to all connected users.

C. Query Forwarding Node (QFN):

A user responsible for forwarding queries to the

WSE, downloading results from WSE, encrypting

results and broadcasting encrypted results to all

member of the group.

D. Web Search Engine:

WSE like Google, Bing, yahoo, AOL etc. are

software system which browse data from World

Wide Web. When a WSE receives a query it records

user details like IP, OS details, browser type,

language, preferences etc. and builds user’s profile so

that it can give you an personalized results. WSE has

no intent to protect the privacy of the user.

IV. POSHIDA OVERVIEW

The objective of Poshida is to obfuscate the

profile of user submitting queries to the WSE.

Initially all users who wants to submit queries to the

WSE are required to connect to the CS. CS select one

user as a QFN from the list of connected users from

top to bottom on first come first serve basis for a

session. During that session QFN is supposed to

forward the queries of all users at least once, after

that QFN will be changed. Poshida execution is

explained below.

Poshida Description:

Poshida is executed in five steps sequentially. The

protocol is implemented as java applet program

consisting of client side software and server side

software.

1) Connection setup:

The client side software will let the user U to

connect with CS, the CS records the user details like

IP and port number. The CS provides the list of

connected users i.e. {U1, U2…Un} to all other users.

2) Query Forwarding Node (QFN):

CS select each users as QFN one by one (on first

come first serve basis) for a session. CS informs the

user that you have been selected as QFN. QFN

generate his public key and give it to CS. CS

announces QFN IP address, port number and Public

key. The details about QFN will be available at every

client side software

3) Query sending process:

When a node want to send his/her query, will

encrypt his query with the public key of QFN. User

attaches his own public key for later use. User then

generates a random number X between 1 and 100.

That X will be appended with the encrypted query

packet, the appended number X will remain visible.

4) Query shuffling process:

The user randomly select a peer user from the list

of all connected users, forwards encrypted query

packet to him. The probability of selecting any peer

user is same. The peer user upon receiving the packet

will check if the appended X is even or odd, if X was

even user will forward the packet to the QFN. If the

X found to be odd that user will generate another

random number, replace the previous X and forward

it to another peer user, selected randomly from list.

The process repeat until X found to be even. After

few passes encrypted query packet reaches the QFN.

5) Query forwarding to Web Search Engine:

When the packet arrives at the QFN. QFN

decrypts the packet with the private key. The public

key of the query originating user is placed in a side,

QFN sends the user query to the WSE, downloads the

results retrieved by WSE. QFN encrypts the result

with the public key of user requesting the query.

QFN broadcast the encrypted result to all users. Only

the desired user who have the private key will be able

to decrypt the result. Process is shown in fig 1.

Key generation and encryption:

KeyPairGenerator class instance used to generate the

pair of keys. 1024 bits length keys will be used for

467

encryption. Each user will generate its own pair of

keys, each user will be responsible for generating a

new pair of key every time he connects to the system

V. PRIVACY OF PROPOSED PROTOCOL

The privacy of user using Poshida will be

investigated against the following dishonest entities:

A. Against Peer users:

Peer users will not be able to see the query of

user because the query is encrypted with the public

key of QFN. When a user receive a query packet

with a random number he will not be able to link it

with the user because the receiving user cannot be

certain if the node who has forwarded the query

packet is the query originator or just the forwarder?

The user query is encrypted under the public key of

QFN, no one but the QFN will be able to decrypt it.

When the QFN downloads the result for the query.

QFN encrypts the result with the public key of query

making user, the encrypted results are then

broadcasted. Hence no one but the query originating

user will be able to see the results. Thus query and

results remain hidden from rest of peer user.

Dishonest user will only be able to link query with

the user if n-1 user in the query sending process are

compromised.

B. Against the Central server:

CS is a dedicated computer. The CS only give

information about the QFN and the details of the all

connected users. CS takes no part in query

forwarding or query shuffling. If the CS and QFN are

compromised node, malicious node or a curious

node, still they will not be able to link any query with

the user because at the shuffling stage user selects a

peer user randomly from the list of available users.

The probability of selecting any random user from

the list are same. The probability of linking a query

with user is 1/n. where n is the number of honest

user. In this protocol all user are placed in a single

group.

C. Against the QFN:

The QFN will be able to see the query but the query

will remain un-linkable with the user. QFN will not

be certain who actually has made the query.

D. Against the WSE:

The WSE receives the query from the QFN. WSE

will consider QFN is the query originator who is in

fact a query forwarding user. QFN is forwarding

queries of all online users, having miscellany of

interest, hence after the execution of Poshida the

profile of users will be highly obfuscated. With

previous system [17, 21, 22, 24] each user was

forwarding other users’ queries to WSE but not his

own. WSE was certain that queries forwarded by user

is not his own increases the risk of privacy breach.

However in Pohsida QFN can forward his query and

all other nodes queries, this increase the privacy. To

evaluate the level of privacy a user achieve by

executing Poshida, a test is performed to see how

much profile is obfuscated. The test compares when

users simulate Poshida and submit queries through

QFN to WSE to if they would have directly

submitting the queries to the WSE. Let P represents

the user original profile without using the protocol

while P’ represents the obfuscated profile i.e. when

user is using the protocol. The test is performed on

AOL query dataset [23] to find the difference

between P and P’. The first step of this test is to

extract the profile of a user from the AOL queries.

Authors in [22] proposed two steps to perform the

queries categorization task, i.e. morpho-synthetic

analysis and semantic analysis of queries. In first step

the natural language processing (NLP) technique

(sentence, detection, tokenization, part-of-speech

tagging, syntactic parsing, stop word removal, and

stemming) are performed to get the main topic of the

query detailed in [27]. The term obtained in previous

step are send to ODP (dmoz.org) to get the category

of query topic. ODP is the largest human editable

web directory maintained by community of volunteer

[28]. When queries are sent to dmoz.org [28] it

classifies the query into different category as shown

in table 1. Consider the query James Bond it is found

in ODP directory at “Arts: Movies: Titles: J: James

Bond Series: Fan Pages” at first degree it is “Arts” at

second is “Movies” etc. The user profile contains

arts, movies, James Bonds etc. Applying these two

techniques the user queries are classified into

different categories and user profile is extracted. Now

we will have the categories for user original profile P

and obfuscated profile P’. To measure the difference

between P and P’ Profile Exposure Level (PEL)

mention in (3) is be used. PEL calculation is given in

[21, 22 and 24].

TABLE 1. EXAMPLE OF QUERY CLASSIFICATION BY ODP

Query

ODP classification at different degrees

ICC

Sports: Cricket: ICC: Events: World Cup:

Harry

Potter

Kids and Teens: Entertainment: Movies: Titles: Harry

Potter

Qatar

Regional: Middle East: Qatar: Government

James Bond

Arts: Movies: Titles: J: James Bond Series: Fan Pages

E. Privacy parameter selection:

Authors in [22] argued that the privacy of user

increases with the size of the group, Poshida put all

468

users into a single common group, the profile of user

will be significantly obfuscated achieving great

privacy. The test is performed with the real AOL

query set, initially a subset of 100 users with their

queries are randomly selected from the AOL dataset.

For initial experiment 10 users joins the CS and

executes the Poshida, then 20 users, and finally 30

users’ executes the Pohsida. Poshida is tested with

100 users initially. When 10 users’ joins sever, the

Poshida iterate 10 time to get the results of 100 users,

for 20 user five times and 30 users three times. The

artificial query log is captured. The profile each user

achieves is calculated based on steps mentioned in

previous section. The simulated profile is compared

with the original profile using Eqn(3) to find the

percentage of obfuscation.

List of all connected users

Step 2:

Randomly select peer user

from the list of all

available users and

forwards his query packet.

Step3:

· user when receive the cipher text packet, check

the number

· If the number is even, forward packet to QFN.

· If not, generate another random number, replace

with the previous number, randomly selects

another user from the list of available user and

forward the packet to it

· This process continues until an even number is

generated, packet finally reaches the QFN

Step 4:

· Upon receiving the cipher text packet the QFN

decrypts the packet

Records the public key of node making query

user randomly selected from

the list, cipher text packet

forwarded to it.

Query Forwarding Node

Step 5:

· Forwards the query to web search

engine

· Download the links of the result

retrieved by the WSE.

Step 6:

· Encrypt the result with the public key

of query making node

· Broadcast the encrypted results in the

group

Forward query to WSE &

Download result

Step 1:

· When a user wants to

make a query, encrypt his

query with the public key

of QFN

· Attaches a random

number

· Information of QFN is

already obtained from

Central server

Step 7:

· All users will try to

decrypt the packet, but

the one having the private

key would be able to

decrypt the packet

WSE

Fig 1.Step by step description of protocol

VI. RESULTS AND DISCUSSIONS

The result obtained are summarized in table 2.

The result shows the average profile exposure level

for different ODP categories. These percentage

denote the evidence of real profile disclosure by the

observance of obfuscated profile. ODP category at

second degree provides sufficient and consistent

level of specificity [28] in order to evaluate the

profile. However we tested it for four degrees to get

a more detailed view of privacy. Degree 1 shows

more general category of query topic but as we go to

second, third and fourth degree query topic tends to

more and most specific, Poshida achieves higher

privacy. At degree 2 more than 70 % of profile

remain hidden, at degree 3 this percentage raises to

80% for 20 users simulating together, 78.5 and 79

for 10 and 30 while at degree 4 more than 87% of

profile remains hidden.

VII. CONCLUSION AND FUTURE WORK

WSE builds the user profile to provide personalized

search result over the queries she receive from the

user. However the Web queries sometime contains

sensitive information which threatens the user

privacy. This work focused privacy problem, a novel

protocol Poshida is proposed to achieve the privacy.

The Poshida initially simulated for 100 users, with

single group, 10, 20 and 30 users simulated the

protocol the results are quite encouraging.

Fig 2. Profile Exposure level at different degree

TABLE 2: AVERAGE PEL AT DIFFERENT DEGREE

Users

Degree 1

Degree 2

Degree 3

Degree 4

40.09

27.50

23.04

13.61

53.74

27.18

19.39

12.98

55.06

28.66

22.85

14.33

In future, we will evaluate Poshida for its

performance in terms of delay Poshida carries in

query answering. The query shuffling phase is based

on finding an even number, in future a better

0.0

10.0

20.0

30.0

40.0

50.0

60.0

Degree 1 Degree 2 Degree 3

Degree 4

Profile Exposure Level

10 Users

20 Users

30 Users

469

technique would be required to find how many

maximum passes a packet may take before reaching

the QFN. Poshida will be simulated with higher

number of users i.e. 1000 users and 2000 users to

check its performance and scalability.

References

[1] Renan Cattelan , Darko Kirovski, Towards

improving the online shopping experience: A

client-based platform for post-processing Web

search results, Web Intelligence and Agent

Systems, v.10 n.2, p.209-231, April 2012

[2] Cooper, Alissa. "A survey of query log privacy-

enhancing techniques from a policy perspective."

ACM Transactions on the Web (TWEB) 2.4

(2008): 19.

[3] Hannak, Aniko, et al. "Measuring

personalization of web search. "Proceedings of

the 22nd international conference on World Wide

Web. International World Wide Web

Conferences Steering Committee, 2013.

[4] E. Steel, A web pioneer profiles users by name,

Wall Street J (2010).

[5] Fung, Benjamin, et al. "Privacy-preserving data

publishing: A survey of recent

developments." ACM Computing Surveys

(CSUR) 42.4 (2010): 14.

[6] Saint-Jean, Felipe, et al. "Private web search."

Proceedings of the 2007 ACM workshop on

Privacy in electronic society. ACM, 2007.

[7] Lindell, Yehuda, and Erez Waisbard. "Private

web search with malicious adversaries." Privacy

Enhancing Technologies. Springer Berlin

Heidelberg, 2010.

[8] Arampatzis, Avi, George Drosatos, and Pavlos S.

Efraimidis. "Versatile Query Scrambling for

Private Web Search." Information Retrieval

Journal 18.4 (2015): 331-358.

[9] Barbaro, Michael, Tom Zeller, and Saul Hansell.

"A face is exposed for AOL searcher no.

4417749." New York Times 9.2008 (2006): 8For.

[10] R. Dingledine, N. Mathewson, P. Syverson, Tor:

the secondgeneration onion router, in:

Proceedings of the 13th Conference on USENIX

Security Symposium, 2004, pp. 21–31

[11] Purcell, Kristin, Joanna Brenner, and Lee Rainie.

"Search engine use 2012." (2012).

[12] B. Chor, O. Goldreich, E. Kushilevitz, M. Sudan,

Private information retrieval, Journal of the ACM

45 (1998) 965–981.

[13] Reiter, M. K., & Rubin, A. D. Crowds:

Anonymity for web transactions. ACM

Transactions on Information and System Security

(TISSEC) (1998), 1(1), 66-92.

[14] Domingo-Ferrer, Josep, et al. "User-private

information retrieval based on a peer-to-peer

community." Data & Knowledge Engineering

68.11 (2009): 1237-1252.

[15] Stokes, Klara, and Maria Bras-Amoros. "Optimal

configurations for peer-to-peer user-private

information retrieval." Computers & mathematics

with applications 59.4 (2010): 1568-1577.

[16] Swanson, Colleen M., and Douglas R. Stinson.

"Extended combinatorial constructions for peer-

to-peer user-private information retrieval." arXiv

preprint arXiv:1112.2762 (2011).

[17] Castellà-Roca, Jordi, Alexandre Viejo, and Jordi

Herrera-Joancomartí. "Preserving user’s privacy

in web search engines." Computer

Communications 32.13 (2009): 1541-1551

[18] Swanson, Colleen M., and Douglas R. Stinson.

"Extended combinatorial constructions for peer-

to-peer user-private information retrieval." arXiv

preprint arXiv:1112.2762 (2011).

[19] Cao, Zhengjun, Lihua Liu, and Zhenzhen Yan.

"An Improved Lindell-Waisbard Private Web

Search Scheme." International Journal of

Network Security, Vol.18, No.3, PP.538-543

[20] Wright, Matthew K., et al. "The predecessor

attack: An analysis of a threat to anonymous

communications systems." ACM Transactions on

Information and System Security (TISSEC) 7.4

(2004): 489-522.

[21] Erola, Arnau, et al. "Exploiting social networks to

provide privacy in personalized web search."

Journal of Systems and Software 84.10 (2011):

1734-1745.

[22] Romero-Tris, Cristina, Jordi Castella-Roca, and

Alexandre Viejo. "Distributed system for private

web search with untrusted partners." Computer

Networks67 (2014): 26-42.

[23] ElGamal, Taher. "A public key cryptosystem and

a signature scheme based on discrete logarithms."

Advances in cryptology. Springer Berlin

Heidelberg, 1985.

[24] Viejo, Alexandre, and Jordi Castellà-Roca.

"Using social networks to distort users’ profiles

generated by web search engines." Computer

Networks 54.9 (2010): 1343-1357.

[25] Rivest, R. L., Shamir, A., & Adleman, L. (1978).

A method for obtaining digital signatures and

public-key cryptosystems. Communications of

the ACM, 21(2), 120-126.

[26] Gervais, Arthur, et al. "Quantifying web-search

privacy." Proceedings of the 2014 ACM SIGSAC

Conference on Computer and Communications

Security. ACM, 2014.

[27] C.D. Manning, H. Schütze, Foundations of

Statistical Natural Language Processing, MIT

Press, Cambridge, MA, USA, 1999.

[28] ODP, Open Directory Project, 2013.

<http://www.dmoz.org/>.

[29] C. Eickhoff, K. Collins-Thompson, P. Bennett, S.

Dumais, Designing human-readable user profiles

for search evaluation, in: Proceedings of the 35th

European Conference on Advances in

Information Retrieval, ECIR’13, Springer-Verlag,

Berlin, Heidelberg, 2013, pp.701–705

470

A Survey on Performance Evaluation Mechanisms for Privacy-Aware Web Search Schemes

Chapter

Mar 2023

Due to the exponential growth of information on the internet, web search engines (WSEs) have become indispensable for effectively retrieving information. Web search engines store the users' profiles to provide the most relevant results. However, the user profiles may contain sensitive information, including the user's age, gender, health condition, personal interests, religious or political affiliation, and others. However, this raises serious concerns for the user's privacy since a user's identity may get exposed and misused by third parties. Researchers have proposed several techniques to address the issue of privacy infringement while using WSE, such as anonymizing networks, profile obfuscation, and private information retrieval (PIR) protocols. In this chapter, the authors give a brief survey of the privacy attacks and evaluation models used to evaluate the performance of private web search techniques.

NN-QuPiD Attack: Neural Network-Based Privacy Quantification Model for Private Information Retrieval Protocols

Article

Full-text available

Feb 2021
COMPLEXITY

Web search engines usually keep users’ profiles for multiple purposes, such as result ranking and relevancy, market research, and targeted advertisements. However, user web search history may contain sensitive and private information about the user, such as health condition, personal interests, and affiliations that may infringe users’ privacy since a user’s identity may be exposed and misused by third parties. Numerous techniques are available to address privacy infringement, including Private Information Retrieval (PIR) protocols that use peer nodes to preserve privacy. Previously, we have proved that PIR protocols are vulnerable to the QuPiD Attack. In this research, we proposed NN-QuPiD Attack, an improved version of QuPiD Attack that uses an Artificial Neural Network (RNN) based model to associate queries with their original users. The results show that the NN-QuPiD Attack gave 0.512 Recall with the Precision of 0.923, whereas simple QuPiD Attack gave 0.49 Recall with the Precision of 0.934 with the same data.

ObSecure Logging (OSLo): A Framework to Protect and Evaluate the Web Search Privacy in Health Care Domain

Article

Full-text available

Aug 2019

Users around the world send queries to the Web Search Engine (WSE) to retrieve data from the Internet. Users usually take primary assistance relating to medical information from WSE via search queries. The search queries relating to diseases and treatment is contemplated to be the most personal facts about the user. The search queries often contain identifiable information that can be linked back to the originator, which can compromise the privacy of a user. In this work, we are proposing a distributed privacy-preserving protocol (OSLo) that eliminates limitation in the existing distributed privacy-preserving protocols and a framework, which evaluates the privacy of a user. The OSLo framework asses the local privacy relative to the group of users involved in forwarding query to the WSE and the profile privacy against the profiling of WSE. The privacy analysis shows that the local privacy of a user directly depends on the size of the group and inversely on the number of compromised users. We have performed experiments to evaluate the profile privacy of a user using a privacy metric Profile Exposure Level. The OSLo is simulated with a subset of 1000 users of the AOL query log. The results show that OSLo performs better than the benchmark privacy-preserving protocol on the basis of privacy and delay. Additionally, results depict that the privacy of a user depends on the size of the group.

Privacy Exposure Measure: A Privacy-Preserving Technique for Health-Related Web Search

Article

Full-text available

Aug 2019

The increasing use of web search engines (WSEs) for searching healthcare information has resulted in a growing number of users posting personal health information online. A recent survey demonstrates that over 80% of patients use WSE to seek health information. However, WSE stores these user's queries to analyze user behavior, result ranking, personalization, targeted advertisements, and other activities. Since health-related queries contain privacy-sensitive information that may infringe user's privacy. Therefore, privacy-preserving web search techniques such as anonymizing networks, profile obfuscation, private information retrieval (PIR) protocols etc. are used to ensure the user's privacy. In this paper, we propose Privacy Exposure Measure (PEM), a technique that facilitates user to control his/her privacy exposure while using the PIR protocols. PEM assesses the similarity between the user's profile and query before posting to WSE and assists the user in avoiding privacy exposure. The experiments demonstrate 37.2% difference between users' profile created through PEM-powered-PIR protocol and other usual users' profile. Moreover, PEM offers more privacy to the user even in case of machine-learning attack.

On the self-adjustment of privacy safeguards for query log streams

Article

Aug 2023
COMPUT SECUR

Internet-based services process and store numerous search queries around the globe. The use of web search engines, such as Bing and Google, as well as personal assistants (e.g., Alexa and Cortana) and task specific systems (e.g., YouTube, Netflix, Amazon) are relevant examples. The queries associated to such services may be stored and sold out for profit. Before doing so, personal and sensitive information must be sanitized, as requested by current regulations. This can be cumbersome for some organizations. We present an automated solution for anonymizing unstructured data, like the one used within query logs. Our solution uses a light-weight probabilistic k-anonymity approach, which allows verifiable real-time privacy protection. It addresses previous limitations and improves performance. We validate the feasibility of the approach, under some evaluation metrics including data utility, privacy and speed.

Multi-Group ObScure Logging (MG-OSLo) A Privacy-Preserving Protocol for Private Web Search

Article

Full-text available

May 2021

The Web Search Engine (WSE) is a software system used to retrieve data from the web successfully. WSE uses the user’s search queries to build the user’s profile and provide personalized results. Users’ search queries hold identifiable information that could compromise the privacy of the respective user. This work proposes a multi-group distributed privacy-preserving protocol (MG-OSLo) and tries to investigate the state-of-the-art distributed privacy-preserving protocols for computing web search privacy. The MG-OSLo comprises multiple groups in which each group has a fixed number of users. The MG-OSLo measures the impact of the multi-group on the user’s privacy. The primary objective of this work is to assess local privacy and profile privacy. It aims at evaluating the impact of group size and group count on a user’s privacy. Two grouping approaches are used to group the users in MG-OSLo, i.e. a non-overlapping group design and overlapping group design. The local privacy results reveal that the probability of linking a query to the user depends on the group size and group count. The higher the group size or group count, the lower the likelihood of relating the query to the user. The profile privacy computes the profile obfuscation level using a privacy metric Profile Exposure Level (PEL). Different experiments have been performed to compute the profile privacy of the subset of an AOL query log for two situations: i) self-query submissions allowed and ii) self-query submissions not allowed. The privacy achieved by MG-OSLo is compared with the modern privacy-preserving protocol UUP(e), OSLo, and Co-utile protocols. The results show that the MG-OSLo provided better results as compared to OSLo, UUP, and Co-utile. Similarly, the multi-group has a positive impact on local privacy and user profile privacy.

Privacy enhancing technologies for solving the privacy-personalization paradox: Taxonomy and survey

Article

Aug 2020

Personal data are often collected and processed in a decentralized fashion, within different contexts. For instance, with the emergence of distributed applications, several providers are usually correlating their records, and providing personalized services to their clients. Collected data include geographical and indoor positions of users, their movement patterns as well as sensor-acquired data that may reveal users' physical conditions, habits and interests. Consequently, this may lead to undesired consequences such as unsolicited advertisement and even to discrimination and stalking. To mitigate privacy threats, several techniques emerged, referred to as Privacy Enhancing Technologies, PETs for short. On one hand, the increasing pressure on service providers to protect users' privacy resulted in PETs being adopted. One the other hand, service providers have built their business model on personalized services, e.g. targeted ads and news. The objective of the paper is then to identify which of the PETs have the potential to satisfy both usually divergent - economical and ethical - purposes. This paper identifies a taxonomy classifying eight categories of PETs into three groups, and for better clarity, it considers three categories of personalized services. After defining and presenting the main features of PETs with illustrative examples, the paper points out which PETs best fit each personalized service category. Then, it discusses some of the inter-disciplinary privacy challenges that may slow down the adoption of these techniques, namely: technical, social, legal and economic concerns. Finally, it provides recommendations and highlights several research directions.

Personnalisation de services : quelles technologies pour la préservation de la vie privée ?

Book

Full-text available

Apr 2019

Mieux comprendre les choix d’outils et de stratégies pour préserver la vie privée. D’un côté, des utilisateurs finaux très enclins à bénéficier de recommandations ciblées, d’offres promotionnelles pour un service de proximité ou un produit, des films, des livres, etc. D’autre part des entreprises soucieuses de fidéliser leurs clients en améliorant leur qualité d’expérience et justifiant de cette façon les opérations de collecte de données à caractère personnel qu’ils valorisent par ailleurs. Tel est le constat actuel qui pousse les entreprises à proposer de plus en plus de services personnalisés s’appuyant sur la collecte massive de données à caractère personnel. Cette tendance n’est toutefois pas sans risques -des risques de fuite de données personnelles, d’abus quant à l’exploitation de ces données, de dérives vers des environnements de surveillance –pour n’en citer que quelques-uns. Dans ses derniers travaux de recherche, la Chaire VPIP s’est penchée sur ce problème et a établi une taxonomie et un panorama des technologies préservant la vie privée (Privacy Enhancing Technologies, PETs). A partir de ces résultats scientifiques intitulés "Privacy Enhancing Technologies for solving the Privacy-Personalization Paradox: Taxonomy andSurvey”, un livre électronique a été produit. Destiné à un lectorat qui peut être novice ou avancé sur ces sujets, ce livre est proposé sous différents formats de lecture. Dans un contexte grandissant de personnalisation des services numériques, il permet à chacun de se situer dans les divers choix d’outils et de stratégies pour préserver la vie privée. Ainsi huit technologies préservant la vie privée, qui peuvent être rattachées à trois groupes distincts sont passées en revue et expliquées simplement : • Les techniques orientées utilisateur, qui nécessitent que l’utilisateur gère lui-même la protection de son identité par l’installation de logiciels spécifiques, de la certification à la divulgation contrôlée des attributs le décrivant, • Les techniques orientées serveur, qui nécessitent des outils de traitement de données capables de les anonymiser et d’effectuer des calculs sur des données chiffrées, • Les techniques orientées canal de communication, qui mettent en exergue les caractéristiques de ce dernier, comme le chiffrement des données, ainsi que l’utilisation de serveurs intermédiaires.

State of the Art in Distributed Privacy-Preserving Protocols in Private Web Search

Chapter

Mar 2023

Web search engine (WSE) is an inevitable software system used by people worldwide to retrieve data from the web by using keywords called queries. WSE stores search queries to build the user's profile and provide personalized results. User search queries often hold identifiable information that could compromise the user's privacy. Preserving privacy in web searches is the primary concern of users from various backgrounds. Many techniques have been proposed to preserve a person's web search privacy with time. Some techniques preserve an individual's privacy by obfuscating a user's profile by sending fictitious queries with the original ones. Others hide their identity and preserve privacy through unlinkability. However, a distributed technique preserves privacy by providing unlinkability and obfuscation. In distributed protocols, a group of users collaborate to forward each other queries to WSE, providing unlinkability and obfuscation. This work presents a survey of distributed privacy-preserving protocols. The benefits, limitations, and evaluation parameters are detailed in this work.

Web Search Privacy Evaluation Metrics

Chapter

Mar 2023

Privacy quantification methods are used to quantify the knowledge the adverse search engine has obtained with and without privacy protection mechanisms. Thus, these methods calculate privacy exposure. Private web search techniques are based on many methods (e.g., proxy service, query modification, query exchange, and others). This variety of techniques prompted the researchers to evaluate their work differently. This section introduces the metrics used to evaluate user privacy (protection). Moreover, this section also introduces the metrics used to evaluate the performance of privacy attacks and theoretical evaluation approaches.

Versatile Query Scrambling for Private Web Search

Article

Full-text available

Aug 2015

We consider the problem of privacy leaks suffered by Internet users when they perform web searches, and propose a framework to mitigate them. In brief, given a ‘sensitive’ search query, the objective of our work is to retrieve the target documents from a search engine without disclosing the actual query. Our approach, which builds upon and improves recent work on search privacy, approximates the target search results by replacing the private user query with a set of blurred or scrambled queries. The results of the scrambled queries are then used to cover the private user interest. We model the problem theoretically, define a set of privacy objectives with respect to web search and investigate the effectiveness of the proposed solution with a set of queries with privacy issues on a large web collection. Experiments show great improvements in retrieval effectiveness over a previously reported baseline in the literature. Furthermore, the methods are more versatile, predictably-behaved, applicable to a wider range of information needs, and the privacy they provide is more comprehensible to the end-user. Additionally, we investigate the perceived privacy via a user study, as well as, measure the system’s usefulness taking into account the trade off between retrieval effectiveness and privacy. The practical feasibility of the methods is demonstrated in a field experiment, scrambling queries against a popular web search engine. The findings may have implications for other IR research areas, such as query expansion, query decomposition, and distributed retrieval.

Quantifying Web-Search Privacy

Conference Paper

Full-text available

Jan 2014

Web search queries reveal extensive information about users' personal lives to the search engines and Internet eavesdrop-pers. Obfuscating search queries through adding dummy queries is a practical and user-centric protection mechanism to hide users' search intentions and interests. Despite few such obfuscation methods and tools, there is no generic quantitative methodology for evaluating users' web-search privacy. In this paper, we provide such a methodology. We formalize adversary's background knowledge and attacks, the users' privacy objectives, and the algorithms to eval-uate effectiveness of query obfuscation mechanisms. We build upon machine-learning algorithms to learn the link-ability between user queries. This encompasses the adver-sary's knowledge about the obfuscation mechanism and the users' web-search behavior. Then, we quantify privacy of users with respect to linkage attacks. Our generic attack can run against users for which the adversary does not have any background knowledge, as well as for the cases where some prior queries from the target users are already observed. We quantify privacy at the query level (the link between user's queries) and the semantic level (user's topics of interest). We design a generic tool that can be used for evaluating generic obfuscation mechanisms, and users with different web search behavior. To illustrate our approach in practice, we analyze and compare privacy of users for two example obfuscation mechanisms on a set of real web-search logs.

Designing Human-Readable User Profiles for Search Evaluation

Conference Paper

Full-text available

Mar 2013

Forming an accurate mental model of a user is crucial for the qualitative design and evaluation steps of many information-centric applications such as web search, content recommendation, or advertising. This process can often be time-consuming as search and interaction histories become verbose. In this work, we present and analyze the usefulness of concise human-readable user profiles in order to enhance system tuning and evaluation by means of user studies.

An improved Lindell-Waisbard private web search scheme

Article

Jan 2016

In 2010, Lindell and Waisbard proposed a private web search scheme for malicious adversaries. At the end of the scheme, each party obtains one search word and queries the search engine with the word. We remark that a malicious party could query the search engine with a fake word instead of the word obtained. The malicious party can link the true word toits provider if the victim publicly complain for the false searching result. To x this drawback, each party has to broadcast all shares so as to enable every party to recover all search words and query the search engine with all these words. We also remark that, from a user's perspective, there is a very simple method to achieve the same purpose of private shuffle. When a user wants to privately query the search engine with a word, he can pick another n-1 padding words to form a group of n words and permute these words randomly. Finally, he queries the search engine with all these words.

A Face is exposed for AOL searcher no. 4417749

Article

Jan 2006
New York Times

Buried in a list of 20 million Web search queries collected by AOL and recently released on the Internet is user No. 4417749. The number was assigned by the company to protect the searcher's anonymity, but it was not much of a shield. No. 4417749 conducted hundreds of searches over a three-month period on topics ranging from "numb fingers" to "60 single men" to "dog that urinates on everything." And search by search, click by click, the identity of AOL user No. 4417749 became easier to discern. There are queries for "landscapers in Lilburn, Ga," several people with the last name Arnold and "homes sold in shadow lake subdivision gwinnett county georgia." It did not take much investigating to follow that data trail to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends' medical ailments and loves her three dogs. "Those are my searches," she said, after a reporter read part of the list to her. AOL removed the search data from its site over the weekend and apologized for its release, saying it was an unauthorized move by a team that had hoped it would benefit academic researchers. But the detailed records of searches conducted by Ms. Arnold and 657,000 other Americans, copies of which continue to circulate online, underscore how much people unintentionally reveal about themselves when they use search engines — and how risky it can be for companies like AOL, Google and Yahoo to compile such data. Those risks have long pitted privacy advocates against online marketers and other Internet companies seeking to profit from the Internet's unique ability to track the comings and goings of users, allowing for more focused and therefore more lucrative advertising. But the unintended consequences of all that data being compiled, stored and cross-linked are what Marc Rotenberg, the executive director of the Electronic Privacy Information Center, a privacy rights group in Washington, called "a ticking privacy time bomb." Mr. Rotenberg pointed to Google's own joust earlier this year with the Justice Department over a subpoena for some of its search data. The company successfully fended off the agency's demand in court, but several other search companies, including AOL, complied. The Justice Department sought the information to help it defend a challenge to a law that is meant to shield children from sexually explicit material.

A Web Pioneer Profiles Users by Name

Article

Emily Steel

Republican politics, has an interest in the Bible and contributes to political and environmental causes. Mrs. Twombly's profile is part of RapLeaf's rich trove of data, garnered from a variety of sources and which both political parties have tapped. A company called RapLeaf is building databases on people by tapping voter-registration files, shopping histories, social-networking activities and real estate records. WSJ's Emily Steel and Julia Angwin join the Digits show to discuss which sites are using RapLeaf, and what web users can do to try to protect their privacy.

Measuring personalization of Web search

Conference Paper

May 2013

Web search is an integral part of our daily lives. Recently, there has been a trend of personalization in Web search, where different users receive different results for the same search query. The increasing personalization is leading to concerns about Filter Bubble effects, where certain users are simply unable to access information that the search engines' algorithm decides is irrelevant. Despite these concerns, there has been little quantification of the extent of personalization in Web search today, or the user attributes that cause it. In light of this situation, we make three contributions. First, we develop a methodology for measuring personalization in Web search results. While conceptually simple, there are numerous details that our methodology must handle in order to accurately attribute differences in search results to personalization. Second, we apply our methodology to 200 users on Google Web Search; we find that, on average, 11.7% of results show differences due to personalization, but that this varies widely by search query and by result ranking. Third, we investigate the causes of personalization on Google Web Search. Surprisingly, we only find measurable personalization as a result of searching with a logged in account and the IP address of the searching user. Our results are a first step towards understanding the extent and effects of personalization on Web search engines today.

Towards improving the online shopping experience: A client-based platform for post-processing Web search results

Article

Apr 2012

The quality of results to Web search queries is substantially limited because of the cost and short processing times allowed at search engine's data center to retrieve relevant pages, augment ads, and present them to the end-user. We tackle such an issue by proposing a radically different system, where the search engine replies to a query with a large list of relevant URLs. Our client-side platform then proceeds to download the target HTML files only, parse them, understand their content, and present a summary to the user. Different types of summaries can be created by using plug-ins attached to the base platform. Each plug-in provides the required functionality according to the type of summary desired by the user. With this novel mechanism, our system offers increased computational power for post-processing search results and consequently improves and personalizes the user's search experience while maintaining constant workload at the search engine. We present prototype implementations of the proposed search assistant and an associated shopping plug-in capable of detecting whether a Web-page encapsulates a direct commercial offering. We also review measurements related to projected system performance and demonstrate its applicability to different scenarios.

Distributed System for Private Web Search with Untrusted Partners

Article

Jul 2014
COMPUT NETW

Crowds: Anonymity for Web Transactions

Article

Jan 1997

In this paper we introduce a system called Crowds for protecting users' anonymity on the world-wide-web. Crowds, named for the notion of “blending into a crowd,” operates by grouping users into a large and geographically diverse group (crowd) that collectively issues requests on behalf of its members. Web servers are unable to learn the true source of a request because it is equally likely to have originated from any member of the crowd, and even collaborating crowd members cannot distinguish the originator of a request from a member who is merely forwarding the request on behalf of another. We describe the design, implementation, security, performance, and scalability of our system. Our security analysis introduces degrees of anonymity as an important tool for describing and proving anonymity properties.

Poshida, a protocol for private information retrieval

Abstract

Recommended publications

Evaluation of web search engine performances in the field of biology

Design and implementation of a personal information retrieval system in the web

Revealing PIR protocols protected users

Poshida II, a Multi Group Distributed Peer to Peer Protocol for Private Web Search