Conference PaperPDF Available

Poshida, a protocol for private information retrieval

Authors:

Abstract

Web Search Engines are the easy source to get data from Internet. WSE retrieve the information from the ocean of data according to the user generated queries. In return the WSE records the queries to build the user profile and retrieves the personalize search results. WSE sells the query log to marketing companies to generate revenue. This poses a threat to user privacy. In 2006 AOL released query log for research purpose but those logs were not properly anonymized, lead to the discovery of users. Performing private web search and achieving privacy is the active area of research. Many technique has been proposed to hide the identity of users from the WSE, However state of the art scheme is yet to device to privately retrieve information from WSE. This paper proposes a new protocol to obfuscate the user profile which is maintained by WSE. Profile Exposure level is used to measure the level of privacy a user achieve from WSE. Results show the protocol hides 50 percent of user profile at first degree 70% at second, 77% at third and more than 85% at fourth degree.
Poshida, A Protocol for Private Information
Retrieval
Mohib Ullah, Rafiullah Khan Muhammad Arshad Islam
IBMS/CS Department of Computer Science
The University of Agriculture Capital University of Science and Technology
Peshawar, Pakistan. Islamabad, Pakistan
mohibullah@aup.edu.pk, rafiyz@aup.edu.pk arshad.islam@cust.edu.pk
AbstractWeb Search Engines are the easy source to
get data from Internet. WSE retrieve the information
from the ocean of data according to the user generated
queries. In return the WSE records the queries to build
the user profile and retrieves the personalize search
results. WSE sells the query log to marketing companies
to generate revenue. This poses a threat to user privacy.
In 2006 AOL released query log for research purpose
but those logs were not properly anonymized, lead to
the discovery of users. Performing private web search
and achieving privacy is the active area of research.
Many technique has been proposed to hide the identity
of users from the WSE, However state of the art scheme
is yet to device to privately retrieve information from
WSE. This paper proposes a new protocol to obfuscate
the user profile which is maintained by WSE. Profile
Exposure level is used to measure the level of privacy a
user achieve from WSE. Results show the protocol hides
50 percent of user profile at first degree 70% at second,
77% at third and more than 85% at fourth degree.
Keywords: Web Search Engine Privacy, Profile
Exposure Level, Profile Obfuscation;
I. INTRODUCTION
Internet is the enormous warehouse of data,
holds range of material and almost contains
information about anything. People from every
category, class or country definitely need information
resting on WWW. The Web Search Engine (WSE)
like Google, Ask, Bing, AOL, Baidu etc. let us to
retrieve relevant information from the web through
search queries. In searching and information retrieval
process the WSEs records all submitted queries
called a query log. A typical query log may contain
user’s query content, machine IP address, operating
system details, browser type, query’s date & time,
browser language, some important preferences, and
cookies that possibly used to matchlessly recognize
the user’s browser [1,2]. WSEs claims that they
evaluate the query log through certain algorithms for
a long period to profile & categories the users
according to their interests. A typical query is almost
three words long, which reveals very less information
about the actual interest of the user. Sometime certain
words can cause ambiguity for example a search
query on keyword “mouse”, it is an animal and a
computer device. WSEs uses the user’s profile to
show the relevant results [2, 3].
The query log is a precious resource to the
WSEs. WSE often sell the query log for business
purpose to the marketing companies [4]. The query
log frequently hold sensitive evidence about the
people, and the dissemination of such data violates
one’s privacy [5]. In several occasion user queries
contains important information like Unique User ID,
name, user’s employers details, location, religion,
health information, gender orientation, politics, faith,
believes etc. which can be exceptionally sensitive for
the possessor [6]. Release of such information poses
a serious risk to the user privacy, which can be
compromised. Among many, one of the key risk is
the disclosure to third party (e.g. advertiser, media
etc.)[5, 22] for business purposes or finding
information about product of competitor companies.
The major privacy scandal are the release of AOL log
in 2006 where twenty million queries generated by
658000 users in three months of time were published
for research purpose. Before the release the Query
log were anonymized, the IP addresses were replaced
with unique ID (pseudonym). A user named Thelma
Arnold with ID 4417749 was successfully
discovered [8, 9] as Query log were not appropriately
anonymized. Another episode when US department
of justice issued a summon [2] to AOL, Yahoo,
Google and Microsoft to provide a query log to find
out, if the internet filters are effectively protecting
children from adult contents on internet? as a part of
litigation of an Internet child safety law [2]. These
incidents put a question mark on WSEs policy about
user’s privacy. Many users demanded that WSE
should not maintain the query log [11].
978-1-5090-2000-3/16/$31.00 ©2016 IEEE
464
A. Related work:
So far many techniques have been proposed to
obfuscate the user profile. These technique are
generally classified into four categories i.e.
standalone schemes, third party infrastructure, query
scrambling and distributed scheme. Standalone
schemes like TrackMeNot and GooPIR protects the
privacy of lone user. TrackMeNot is Firefox plugin to
programmatically create a query seed file and send
some noise queries with original query. Using this
will hide user’s original query among noise queries
and thus obfuscate his profile. GooPIR mask the
original query with k-1 false queries. In both case the
machine generated queries can be distinguished from
real human queries. Third party infrastructure like
Scroogle1, anonymizer2 etc. are proxy services in
which user forward his queries to proxy server and
the server forwards it to the WSE. The problem
stands the same i.e. profiling can now be done at
proxy server. TOR (The Onion Routing) is the group
of volunteer operated network servers to provide
privacy [10].TOR provides the anonymity at network
layer. But TOR was not specifically designed for
Web anonymity, WSE can identify a TOR user at
application layer. Query Scrambling is a novel
technique [8] proposed to obfuscate the user profile,
this technique never tries to hide the identity of user
instead the user query is scrambled in such a way that
it loosely correspond to the user’s genuine interest
thus distorts the user profile. The real query is
divided into multiple scrambled queries. Those
queries are then submitted to the WSE, results are
collected, and a scrambled ranking is applied to the
result list to get the actual interest of user called
descrambling [8]. Distributed schemes work by the
cooperation of multiple users, to diffuse the identity
of users in the group. This technique works when
group of users collaborate with each other. Each user
forward someone else's query in return his query is
forward by another user thus obfuscating the profile
with real human queries.
Private information retrieval (PIR) was the first
technique proposed to retrieve an element from the
database, hopes not to let the database know what
item the user interested in. Crowds [13] was the first
distributed scheme introduced to attain anonymity.
Crowds was consist of client side software called
jondo and server side software called blender. For
making a web transaction a user first needs to join the
crowds, then a biased coin is flipped to determine
whether directly send it to WSE or forward it to other
1 http://scroogle.org
2 http://www.anonymizer.com.
member of crowd. The reply from the WSE come
through the same path. Crowds succeed in some
degree of anonymity against the WSE but no privacy
against the local eavesdropping, also the query and
answer remain visible to all member of crowd in a
path between the originating user and WSE.
User Private Information Retrieval UPIR
considered the community of users sharing common
memory location for writing queries, reading queries
and recording answer of WSE [14]. Three flavors of
UPIR are one to one, all to all and configuration
based UPIR protocols introduced from time to time.
Each method had some weaknesses and unable to
provide privacy. Optimal configuration of peer to
peer network using combinatoric configuration
(v,b,r,k) design presented to achieve the privacy[15]
but [18] managed to compromise their technique with
intersection attack and algebraic functions.
Useless User Profile (UUP) protocol [17] yet another
distributed concept involving user, central node and
WSE. Authors in [7] investigated that UUP is
unsecure in the presence of even a single malicious
user. UUP’s privacy was compromised by [7] with
four types of attack, then modified the UUP with
double encryption. Authors in [7] managed to
achieve higher privacy with considerable delay.
However [19] made an attack on [7] and concluded
that user privacy of [7] technique can be
compromised at verification stage.
Existing social network can be used for private
web search investigated by [21, 24]. Those
approaches achieved privacy from the WSE but
query remained visible among the social network
peers. Each user forwarding an unencrypted query, if
multiple users collaborate together they could find
the query initiator through predecessor attack [20].
Authors in [22] updated the UUP concept and made it
secure in the presence of untrusted partner by using
the Optimized Arbitrary Size (OAS) Benes Network
and ElGamal group key encryption. Queries are
shuffled in such a way that no one should be able to
link any query with the user. PEP and DISPEP a zero
knowledge proof technique used for proving privacy.
Authors in [22] used a group size of three however
using group size of four and five users put significant
delay due to complex shuffling. They achieved
considerable privacy at local side but the result are
broadcasted in clear text, so everyone knows what is
being searched in a group. WSE can also manage to
find the user through intersection attack.
B. Ethical Issues:
Using any distributed system protocol each user
forwards other user’s query. However this is an
automated process, user is not required to check
465
queries manually. In worst case a user may forward a
dangerous query of some other user. In such case an
innocent may be caught over query which he never
owned. Such situation may be avoided if each query
is filtered out before forwarding to WSE. Filtering
each query and classifying into innocent and
dangerous category another field of research and is
beyond the scope of this paper. Authors in [24]
escalated the same issue, also provided a liability
mechanism to track down the originator of the
dangerous query. Liability mechanism put extra
burden and degrades the system performance.
However in future a comprehensive solution is
required to deal the problem. This work is not
considering the ethical issues primarily. It is assumed
that all users are making benign queries and no one
forwards a dangerous query.
C. Plan of this paper:
This work proposes a new protocol "Poshida, A
Protocol for Private Information Retrieval". Poshida
aims to enhance the privacy of a user both locally and
against the WSE by obfuscating the user profile.
Poshida first goal is to privately retrieve information
from WSE while concealing the user identity. Second
is to evaluate the privacy a user preserves against the
peer node, central server and WSE. Poshida is
initially simulated with real data extracted from AOL
query log dataset of 100 users. The privacy of user
will be evaluated with the privacy metric Profile
Exposure level (PEL) suggested by [21, 22, 24]. PEL
estimates the privacy level a user achieve against the
WSE while executing the proposed protocol.
Section II describe the Background and notation,
section III explains System Model, IV Pohsida
overview, V privacy of proposed Protocol, VI Results
and Discussion and VII conclusion and future work.
II. BACKGROUND AND NOTATIONS
A. RSA Algorithm:
Asymmetric cryptographic algorithm introduced
in 1977 uses two different key for encryption and
decryption [25]. The cryptographic operation
performed by RSA algorithm in three steps.
Key Generation: following steps are necessary to
generate the public, private keys
Steps:
1. Choose two random large prime numbers x, y of
similar length.
2. Calculate n=x*y
3. Compute Euler’s phi function Ø(n) such that
Ø(n) = (x-1) * (y-1).
4. Pick an integer ‘e’ such that 1< e < Ø(n) and GCD
of e and Ø(n) is 1.
5. Public Key: The pair of numbers (n, e) forms RSA
public key and it is made public.
6. Private Key‘d’ is computed from x, y and e.
Where d is the multiplicative inverse of e mod Ø(n)
i.e. d = e-1 (mod Ø(n)), private key (d, n) is kept
private
Encryption: To encrypt a plain text P, it is first
represented as a series of number using settled upon
revocable protocol, known as padding scheme. The
encryption procedure is accomplished over
mathematical step eq. (1)
 =   (1)
Decryption: The decryption of RSA is simple the
cipher text c is power with private key d then
modulus n to get the plain text,
 =    (2)
B. Random Selection Method:
Each user downloads a list of all connected users
from CS. During a query shuffling process the query
originating user selects another user randomly from
the list and forward his query. The probability of
selecting any user is same for the rest of the node i.e.
1/n, where n is the total number of users.
C. Privacy Evaluation:
Authors in [21, 24] introduced a term profile
exposure level (PEL) to measure the privacy of a user
from WSE. PEL uses mutual information and entropy
to measure the level of user profile exposure.
 =(,)
()∗ 100 (3)
Where M represent the set of categories of queries
which user actually generates, N represents the set of
categories of queries what user send to the WSE, N
contains the other user’s queries categories
H (M) is the entropy of M,
()=().  p(m) (4)
I (M, N) is the mutual information
(,)= ()−  (/) (5)
(,)=(/).()., (
()) (6)
 (/) is the conditional entropy. p(m) and p(n)
are the probabilities of each element of M and N
proportional to its cardinal. In this work PEL will be
used to measure the percentage of information
exposed when user forwards other users queries.
III. SYSTEM MODEL
466
This segment define the entities that participate
in Poshida. We have the following entities in Poshida
architecture:
i. User
ii. Central Server
iii. Query forwarding node
iv. Web Search Engine
A. User:
The person who is actually making a web search
query. We have honest users, curious users and
dishonest user. Honest user aims is to protect his / her
privacy, whereas the curious user and dishonest
motivation is to find the queries of honest users.
B. Central Server (CS):
A server which supervises the connection of
user to the system. Provides each user with the list of
all connected users. CS do the Selection and
announcement of Query Forwarding Node (QFN).
When a user / node connects to central Server, the CS
records the user’s IP address and port number. The
CS distribute this information to all connected users.
C. Query Forwarding Node (QFN):
A user responsible for forwarding queries to the
WSE, downloading results from WSE, encrypting
results and broadcasting encrypted results to all
member of the group.
D. Web Search Engine:
WSE like Google, Bing, yahoo, AOL etc. are
software system which browse data from World
Wide Web. When a WSE receives a query it records
user details like IP, OS details, browser type,
language, preferences etc. and builds user’s profile so
that it can give you an personalized results. WSE has
no intent to protect the privacy of the user.
IV. POSHIDA OVERVIEW
The objective of Poshida is to obfuscate the
profile of user submitting queries to the WSE.
Initially all users who wants to submit queries to the
WSE are required to connect to the CS. CS select one
user as a QFN from the list of connected users from
top to bottom on first come first serve basis for a
session. During that session QFN is supposed to
forward the queries of all users at least once, after
that QFN will be changed. Poshida execution is
explained below.
Poshida Description:
Poshida is executed in five steps sequentially. The
protocol is implemented as java applet program
consisting of client side software and server side
software.
1) Connection setup:
The client side software will let the user U to
connect with CS, the CS records the user details like
IP and port number. The CS provides the list of
connected users i.e. {U1, U2Un} to all other users.
2) Query Forwarding Node (QFN):
CS select each users as QFN one by one (on first
come first serve basis) for a session. CS informs the
user that you have been selected as QFN. QFN
generate his public key and give it to CS. CS
announces QFN IP address, port number and Public
key. The details about QFN will be available at every
client side software
3) Query sending process:
When a node want to send his/her query, will
encrypt his query with the public key of QFN. User
attaches his own public key for later use. User then
generates a random number X between 1 and 100.
That X will be appended with the encrypted query
packet, the appended number X will remain visible.
4) Query shuffling process:
The user randomly select a peer user from the list
of all connected users, forwards encrypted query
packet to him. The probability of selecting any peer
user is same. The peer user upon receiving the packet
will check if the appended X is even or odd, if X was
even user will forward the packet to the QFN. If the
X found to be odd that user will generate another
random number, replace the previous X and forward
it to another peer user, selected randomly from list.
The process repeat until X found to be even. After
few passes encrypted query packet reaches the QFN.
5) Query forwarding to Web Search Engine:
When the packet arrives at the QFN. QFN
decrypts the packet with the private key. The public
key of the query originating user is placed in a side,
QFN sends the user query to the WSE, downloads the
results retrieved by WSE. QFN encrypts the result
with the public key of user requesting the query.
QFN broadcast the encrypted result to all users. Only
the desired user who have the private key will be able
to decrypt the result. Process is shown in fig 1.
Key generation and encryption:
KeyPairGenerator class instance used to generate the
pair of keys. 1024 bits length keys will be used for
467
encryption. Each user will generate its own pair of
keys, each user will be responsible for generating a
new pair of key every time he connects to the system
V. PRIVACY OF PROPOSED PROTOCOL
The privacy of user using Poshida will be
investigated against the following dishonest entities:
A. Against Peer users:
Peer users will not be able to see the query of
user because the query is encrypted with the public
key of QFN. When a user receive a query packet
with a random number he will not be able to link it
with the user because the receiving user cannot be
certain if the node who has forwarded the query
packet is the query originator or just the forwarder?
The user query is encrypted under the public key of
QFN, no one but the QFN will be able to decrypt it.
When the QFN downloads the result for the query.
QFN encrypts the result with the public key of query
making user, the encrypted results are then
broadcasted. Hence no one but the query originating
user will be able to see the results. Thus query and
results remain hidden from rest of peer user.
Dishonest user will only be able to link query with
the user if n-1 user in the query sending process are
compromised.
B. Against the Central server:
CS is a dedicated computer. The CS only give
information about the QFN and the details of the all
connected users. CS takes no part in query
forwarding or query shuffling. If the CS and QFN are
compromised node, malicious node or a curious
node, still they will not be able to link any query with
the user because at the shuffling stage user selects a
peer user randomly from the list of available users.
The probability of selecting any random user from
the list are same. The probability of linking a query
with user is 1/n. where n is the number of honest
user. In this protocol all user are placed in a single
group.
C. Against the QFN:
The QFN will be able to see the query but the query
will remain un-linkable with the user. QFN will not
be certain who actually has made the query.
D. Against the WSE:
The WSE receives the query from the QFN. WSE
will consider QFN is the query originator who is in
fact a query forwarding user. QFN is forwarding
queries of all online users, having miscellany of
interest, hence after the execution of Poshida the
profile of users will be highly obfuscated. With
previous system [17, 21, 22, 24] each user was
forwarding other users’ queries to WSE but not his
own. WSE was certain that queries forwarded by user
is not his own increases the risk of privacy breach.
However in Pohsida QFN can forward his query and
all other nodes queries, this increase the privacy. To
evaluate the level of privacy a user achieve by
executing Poshida, a test is performed to see how
much profile is obfuscated. The test compares when
users simulate Poshida and submit queries through
QFN to WSE to if they would have directly
submitting the queries to the WSE. Let P represents
the user original profile without using the protocol
while P represents the obfuscated profile i.e. when
user is using the protocol. The test is performed on
AOL query dataset [23] to find the difference
between P and P. The first step of this test is to
extract the profile of a user from the AOL queries.
Authors in [22] proposed two steps to perform the
queries categorization task, i.e. morpho-synthetic
analysis and semantic analysis of queries. In first step
the natural language processing (NLP) technique
(sentence, detection, tokenization, part-of-speech
tagging, syntactic parsing, stop word removal, and
stemming) are performed to get the main topic of the
query detailed in [27]. The term obtained in previous
step are send to ODP (dmoz.org) to get the category
of query topic. ODP is the largest human editable
web directory maintained by community of volunteer
[28]. When queries are sent to dmoz.org [28] it
classifies the query into different category as shown
in table 1. Consider the query James Bond it is found
in ODP directory at “Arts: Movies: Titles: J: James
Bond Series: Fan Pages” at first degree it is “Arts” at
second is “Movies” etc. The user profile contains
arts, movies, James Bonds etc. Applying these two
techniques the user queries are classified into
different categories and user profile is extracted. Now
we will have the categories for user original profile P
and obfuscated profile P. To measure the difference
between P and P Profile Exposure Level (PEL)
mention in (3) is be used. PEL calculation is given in
[21, 22 and 24].
TABLE 1. EXAMPLE OF QUERY CLASSIFICATION BY ODP
Query
ODP classification at different degrees
ICC
Sports: Cricket: ICC: Events: World Cup:
Harry
Potter
Kids and Teens: Entertainment: Movies: Titles: Harry
Potter
Qatar
Regional: Middle East: Qatar: Government
James Bond
Arts: Movies: Titles: J: James Bond Series: Fan Pages
E. Privacy parameter selection:
Authors in [22] argued that the privacy of user
increases with the size of the group, Poshida put all
468
users into a single common group, the profile of user
will be significantly obfuscated achieving great
privacy. The test is performed with the real AOL
query set, initially a subset of 100 users with their
queries are randomly selected from the AOL dataset.
For initial experiment 10 users joins the CS and
executes the Poshida, then 20 users, and finally 30
users’ executes the Pohsida. Poshida is tested with
100 users initially. When 10 users’ joins sever, the
Poshida iterate 10 time to get the results of 100 users,
for 20 user five times and 30 users three times. The
artificial query log is captured. The profile each user
achieves is calculated based on steps mentioned in
previous section. The simulated profile is compared
with the original profile using Eqn(3) to find the
percentage of obfuscation.
Fig 1.Step by step description of protocol
VI. RESULTS AND DISCUSSIONS
The result obtained are summarized in table 2.
The result shows the average profile exposure level
for different ODP categories. These percentage
denote the evidence of real profile disclosure by the
observance of obfuscated profile. ODP category at
second degree provides sufficient and consistent
level of specificity [28] in order to evaluate the
profile. However we tested it for four degrees to get
a more detailed view of privacy. Degree 1 shows
more general category of query topic but as we go to
second, third and fourth degree query topic tends to
more and most specific, Poshida achieves higher
privacy. At degree 2 more than 70 % of profile
remain hidden, at degree 3 this percentage raises to
80% for 20 users simulating together, 78.5 and 79
for 10 and 30 while at degree 4 more than 87% of
profile remains hidden.
VII. CONCLUSION AND FUTURE WORK
WSE builds the user profile to provide personalized
search result over the queries she receive from the
user. However the Web queries sometime contains
sensitive information which threatens the user
privacy. This work focused privacy problem, a novel
protocol Poshida is proposed to achieve the privacy.
The Poshida initially simulated for 100 users, with
single group, 10, 20 and 30 users simulated the
protocol the results are quite encouraging.
Fig 2. Profile Exposure level at different degree
TABLE 2: AVERAGE PEL AT DIFFERENT DEGREE
Users
Degree 1
Degree 2
Degree 3
Degree 4
10
40.09
27.50
23.04
13.61
20
53.74
27.18
19.39
12.98
30
55.06
28.66
22.85
14.33
In future, we will evaluate Poshida for its
performance in terms of delay Poshida carries in
query answering. The query shuffling phase is based
on finding an even number, in future a better
0.0
10.0
20.0
30.0
40.0
50.0
60.0
Degree 1 Degree 2 Degree 3
Degree 4
Profile Exposure Level
10 Users
20 Users
30 Users
469
technique would be required to find how many
maximum passes a packet may take before reaching
the QFN. Poshida will be simulated with higher
number of users i.e. 1000 users and 2000 users to
check its performance and scalability.
References
[1] Renan Cattelan , Darko Kirovski, Towards
improving the online shopping experience: A
client-based platform for post-processing Web
search results, Web Intelligence and Agent
Systems, v.10 n.2, p.209-231, April 2012
[2] Cooper, Alissa. "A survey of query log privacy-
enhancing techniques from a policy perspective."
ACM Transactions on the Web (TWEB) 2.4
(2008): 19.
[3] Hannak, Aniko, et al. "Measuring
personalization of web search. "Proceedings of
the 22nd international conference on World Wide
Web. International World Wide Web
Conferences Steering Committee, 2013.
[4] E. Steel, A web pioneer profiles users by name,
Wall Street J (2010).
[5] Fung, Benjamin, et al. "Privacy-preserving data
publishing: A survey of recent
developments." ACM Computing Surveys
(CSUR) 42.4 (2010): 14.
[6] Saint-Jean, Felipe, et al. "Private web search."
Proceedings of the 2007 ACM workshop on
Privacy in electronic society. ACM, 2007.
[7] Lindell, Yehuda, and Erez Waisbard. "Private
web search with malicious adversaries." Privacy
Enhancing Technologies. Springer Berlin
Heidelberg, 2010.
[8] Arampatzis, Avi, George Drosatos, and Pavlos S.
Efraimidis. "Versatile Query Scrambling for
Private Web Search." Information Retrieval
Journal 18.4 (2015): 331-358.
[9] Barbaro, Michael, Tom Zeller, and Saul Hansell.
"A face is exposed for AOL searcher no.
4417749." New York Times 9.2008 (2006): 8For.
[10] R. Dingledine, N. Mathewson, P. Syverson, Tor:
the secondgeneration onion router, in:
Proceedings of the 13th Conference on USENIX
Security Symposium, 2004, pp. 2131
[11] Purcell, Kristin, Joanna Brenner, and Lee Rainie.
"Search engine use 2012." (2012).
[12] B. Chor, O. Goldreich, E. Kushilevitz, M. Sudan,
Private information retrieval, Journal of the ACM
45 (1998) 965981.
[13] Reiter, M. K., & Rubin, A. D. Crowds:
Anonymity for web transactions. ACM
Transactions on Information and System Security
(TISSEC) (1998), 1(1), 66-92.
[14] Domingo-Ferrer, Josep, et al. "User-private
information retrieval based on a peer-to-peer
community." Data & Knowledge Engineering
68.11 (2009): 1237-1252.
[15] Stokes, Klara, and Maria Bras-Amoros. "Optimal
configurations for peer-to-peer user-private
information retrieval." Computers & mathematics
with applications 59.4 (2010): 1568-1577.
[16] Swanson, Colleen M., and Douglas R. Stinson.
"Extended combinatorial constructions for peer-
to-peer user-private information retrieval." arXiv
preprint arXiv:1112.2762 (2011).
[17] Castellà-Roca, Jordi, Alexandre Viejo, and Jordi
Herrera-Joancomartí. "Preserving user’s privacy
in web search engines." Computer
Communications 32.13 (2009): 1541-1551
[18] Swanson, Colleen M., and Douglas R. Stinson.
"Extended combinatorial constructions for peer-
to-peer user-private information retrieval." arXiv
preprint arXiv:1112.2762 (2011).
[19] Cao, Zhengjun, Lihua Liu, and Zhenzhen Yan.
"An Improved Lindell-Waisbard Private Web
Search Scheme." International Journal of
Network Security, Vol.18, No.3, PP.538-543
[20] Wright, Matthew K., et al. "The predecessor
attack: An analysis of a threat to anonymous
communications systems." ACM Transactions on
Information and System Security (TISSEC) 7.4
(2004): 489-522.
[21] Erola, Arnau, et al. "Exploiting social networks to
provide privacy in personalized web search."
Journal of Systems and Software 84.10 (2011):
1734-1745.
[22] Romero-Tris, Cristina, Jordi Castella-Roca, and
Alexandre Viejo. "Distributed system for private
web search with untrusted partners." Computer
Networks67 (2014): 26-42.
[23] ElGamal, Taher. "A public key cryptosystem and
a signature scheme based on discrete logarithms."
Advances in cryptology. Springer Berlin
Heidelberg, 1985.
[24] Viejo, Alexandre, and Jordi Castellà-Roca.
"Using social networks to distort users’ profiles
generated by web search engines." Computer
Networks 54.9 (2010): 1343-1357.
[25] Rivest, R. L., Shamir, A., & Adleman, L. (1978).
A method for obtaining digital signatures and
public-key cryptosystems. Communications of
the ACM, 21(2), 120-126.
[26] Gervais, Arthur, et al. "Quantifying web-search
privacy." Proceedings of the 2014 ACM SIGSAC
Conference on Computer and Communications
Security. ACM, 2014.
[27] C.D. Manning, H. Schütze, Foundations of
Statistical Natural Language Processing, MIT
Press, Cambridge, MA, USA, 1999.
[28] ODP, Open Directory Project, 2013.
<http://www.dmoz.org/>.
[29] C. Eickhoff, K. Collins-Thompson, P. Bennett, S.
Dumais, Designing human-readable user profiles
for search evaluation, in: Proceedings of the 35th
European Conference on Advances in
Information Retrieval, ECIR’13, Springer-Verlag,
Berlin, Heidelberg, 2013, pp.701705
470
... These techniques are proposed to tackle the user privacy infringement problem . There are numerous techniques available to counter privacy infringement, such as proxy networks (Berthold, Federrath, & Köpsell, 2001;Mokhtar et al., 2017), profile obfuscation techniques (Nissenbaum & Daniel, 2009), query scrambling techniques (Arampatzis, Drosatos, & Efraimidis, 2015;Arampatzis, Efraimidis, & Drosatos, 2013), private information retrieval protocols (Reiter & Rubin, 1998;Romero-Tris, Castella-Roca, & Viejo, 2011;Romero-Tris, Viejo, & Castellà-Roca, 2015;Ullah et al., 2019;Ullah et al., 2021;Ullah, Khan, & Islam, 2016a, 2016bUllah et al., 2022;Viejo, Castella-Roca, Bernadó, & Mateo-Sanz, 2012) and others (Chen, Bai, Shou, Chen, & Gao, 2011;Mokhtar, Berthou, Diarra, Quéma, & Shoker, 2013;Mokhtar et al., 2017;Petit, Cerqueus, Mokhtar, Brunie, & Kosch, 2015;Shapira, Elovici, Meshiach, & Kuflik, 2005). ...
... Where p represents a discrete random variable of the user of interest (UoI), the probability of each keyword used by UoI is represented as kj. This measure is used by PaOSLo (Ullah et al., 2022), Poshida (I, II) (Ullah et al., 2016a(Ullah et al., , 2016b, MG OSLo , OSLo , OQF-PIR (Rebollo-Monedero, Forne, & Domingo-Ferrer, 2012), and Balsa's Model (Balsa, Troncoso, & Diaz, 2012). ...
... The original user profile comprises users' original queries, while the obfuscated profile is created through any privacy-preserved web search mechanism. This metric is used by many prevalent Private information retrieval protocols such as PaOSLo (Ullah et al., 2022), MG-OSLo , OSLo , UUP (Juarez & Torra, 2015), Poshida-I (Ullah et al., 2016b), Poshida-II (Ullah et al., 2016a) and others. The mathematical formula of the Profile Exposure level is shown in Equation 6. ...
Chapter
Due to the exponential growth of information on the internet, web search engines (WSEs) have become indispensable for effectively retrieving information. Web search engines store the users' profiles to provide the most relevant results. However, the user profiles may contain sensitive information, including the user's age, gender, health condition, personal interests, religious or political affiliation, and others. However, this raises serious concerns for the user's privacy since a user's identity may get exposed and misused by third parties. Researchers have proposed several techniques to address the issue of privacy infringement while using WSE, such as anonymizing networks, profile obfuscation, and private information retrieval (PIR) protocols. In this chapter, the authors give a brief survey of the privacy attacks and evaluation models used to evaluate the performance of private web search techniques.
... ese techniques include query scrambling [6], profile obfuscation [7], proxy services [8], and Private Information Retrieval (PIR) protocols [9][10][11][12]. In the query scrambling technique, the user query is transformed into diverse minor questions and later posted to the WSE, while in the profile obfuscation technique, the user query is posted to the WSE with fake queries. ...
... In the proxy-based approach, the user submits his/her query to the WSE through the proxy server, whereas, in PIR protocols, a group of users submit queries on behalf of each other to hide their identity. According to the literature, PIR protocols provide better privacy to WSE users as compared to other techniques [1][2][3][4][5][6][7][8][9][10][11][12][13]. Some studies indicate that PIR protocols are vulnerable to machine learning attacks [13,14], especially QuPiD Attack [1,3]. ...
... Moreover, we used the Topic Score feature vector for training and testing of the QuPiD Attack. e Topic Score feature vector comprises a set of numeric values of 10 major topics acquired from the uClassify service [1][2][3][4][5][6][7][8][9][10][11][12][13]. According to the results, QuPiD Attack associated 40% anonymized queries with the correct user with 70% Precision. ...
Article
Full-text available
Web search engines usually keep users’ profiles for multiple purposes, such as result ranking and relevancy, market research, and targeted advertisements. However, user web search history may contain sensitive and private information about the user, such as health condition, personal interests, and affiliations that may infringe users’ privacy since a user’s identity may be exposed and misused by third parties. Numerous techniques are available to address privacy infringement, including Private Information Retrieval (PIR) protocols that use peer nodes to preserve privacy. Previously, we have proved that PIR protocols are vulnerable to the QuPiD Attack. In this research, we proposed NN-QuPiD Attack, an improved version of QuPiD Attack that uses an Artificial Neural Network (RNN) based model to associate queries with their original users. The results show that the NN-QuPiD Attack gave 0.512 Recall with the Precision of 0.923, whereas simple QuPiD Attack gave 0.49 Recall with the Precision of 0.934 with the same data.
... However, WSE can easily recognize the TOR member through device fingerprinting, also, TOR users are vulnerable to global eavesdropping, active and passive adversary [21]. Another approach called distributed schemes, which function by the collaboration of numerous users, where each user forwards other user's query to hide the identity and obfuscate the profile of a user [9,13,[22][23][24][25][26][27][28]. Figure 1 shows the schemes proposed in private web search. The privacy of a user in a private web search is considered preserved if (i) The content of the user's query and result returned by the WSE remains hidden from the group peers, (ii) query content shall not be linked with the originator (iii) WSE shall not be able to build the accurate profile of the users. ...
... Authors in Refs. [9,22,23] have used RSA and ElGamal shared key encryption schemes. In this work, we are using RSA encryption algorithm to achieve confidentiality, RSA is an asymmetric encryption scheme, it is easy to share the public key of RSA. ...
... Authors in Refs. [9,22,23,27] used PEL as a privacy evaluation metric, to evaluate the privacy achieved by the user along-side of WSE. The PEL applies mutual information and entropy to compute the degree of user profile exposure. ...
Article
Full-text available
Users around the world send queries to the Web Search Engine (WSE) to retrieve data from the Internet. Users usually take primary assistance relating to medical information from WSE via search queries. The search queries relating to diseases and treatment is contemplated to be the most personal facts about the user. The search queries often contain identifiable information that can be linked back to the originator, which can compromise the privacy of a user. In this work, we are proposing a distributed privacy-preserving protocol (OSLo) that eliminates limitation in the existing distributed privacy-preserving protocols and a framework, which evaluates the privacy of a user. The OSLo framework asses the local privacy relative to the group of users involved in forwarding query to the WSE and the profile privacy against the profiling of WSE. The privacy analysis shows that the local privacy of a user directly depends on the size of the group and inversely on the number of compromised users. We have performed experiments to evaluate the profile privacy of a user using a privacy metric Profile Exposure Level. The OSLo is simulated with a subset of 1000 users of the AOL query log. The results show that OSLo performs better than the benchmark privacy-preserving protocol on the basis of privacy and delay. Additionally, results depict that the privacy of a user depends on the size of the group.
... In profile obfuscation technique [13][14][15][16][17][18][19][20][21], fake queries are forwarded with the user's query in order to mislead the WSE. In PIR protocols [22][23][24][25][26][27][28][29][30][31], a group of users exchange their queries with each other and submit it to the WSE. While hybrid techniques [32,33] are a combination of more than one aforementioned techniques. ...
... Similarly, in order to avoid the problem of adverse users in the group, Viejo et al. [42] suggested involving user's social media friends for group creation. Ullah et al. proposed Poshida [26] (single group) and Poshida II [27] (multi-group) PIR, that uses a concept of query forwarding node (QFN) to improve privacy. ...
Article
Full-text available
The increasing use of web search engines (WSEs) for searching healthcare information has resulted in a growing number of users posting personal health information online. A recent survey demonstrates that over 80% of patients use WSE to seek health information. However, WSE stores these user's queries to analyze user behavior, result ranking, personalization, targeted advertisements, and other activities. Since health-related queries contain privacy-sensitive information that may infringe user's privacy. Therefore, privacy-preserving web search techniques such as anonymizing networks, profile obfuscation, private information retrieval (PIR) protocols etc. are used to ensure the user's privacy. In this paper, we propose Privacy Exposure Measure (PEM), a technique that facilitates user to control his/her privacy exposure while using the PIR protocols. PEM assesses the similarity between the user's profile and query before posting to WSE and assists the user in avoiding privacy exposure. The experiments demonstrate 37.2% difference between users' profile created through PEM-powered-PIR protocol and other usual users' profile. Moreover, PEM offers more privacy to the user even in case of machine-learning attack.
... Distributed Systems -These approaches necessitate the cooperation of a collective of users working together to safeguard their privacy, effectively concealing their actions within the activities of numerous others (Reiter and Rubin, 1998;Castellà-Roca et al., 2009;Lindell and Waisbard, 2010;Viejo and Castellà-Roca, 2010;Erola et al., 2011b;Romero-Tris et al., 2011bUllah et al., 2016aUllah et al., , 2016b. Typically, these techniques involve placing users into a large group where they submit requests on behalf of other group members, exchanging their queries. ...
Article
Internet-based services process and store numerous search queries around the globe. The use of web search engines, such as Bing and Google, as well as personal assistants (e.g., Alexa and Cortana) and task specific systems (e.g., YouTube, Netflix, Amazon) are relevant examples. The queries associated to such services may be stored and sold out for profit. Before doing so, personal and sensitive information must be sanitized, as requested by current regulations. This can be cumbersome for some organizations. We present an automated solution for anonymizing unstructured data, like the one used within query logs. Our solution uses a light-weight probabilistic k-anonymity approach, which allows verifiable real-time privacy protection. It addresses previous limitations and improves performance. We validate the feasibility of the approach, under some evaluation metrics including data utility, privacy and speed.
... In the dynamic group category, researchers introduced protocols consisting of entities like a user, a central server (CS), and a WSE [6], [14], [24]. These protocols initially created a group after receiving a connection request from the users. ...
Article
Full-text available
The Web Search Engine (WSE) is a software system used to retrieve data from the web successfully. WSE uses the user’s search queries to build the user’s profile and provide personalized results. Users’ search queries hold identifiable information that could compromise the privacy of the respective user. This work proposes a multi-group distributed privacy-preserving protocol (MG-OSLo) and tries to investigate the state-of-the-art distributed privacy-preserving protocols for computing web search privacy. The MG-OSLo comprises multiple groups in which each group has a fixed number of users. The MG-OSLo measures the impact of the multi-group on the user’s privacy. The primary objective of this work is to assess local privacy and profile privacy. It aims at evaluating the impact of group size and group count on a user’s privacy. Two grouping approaches are used to group the users in MG-OSLo, i.e. a non-overlapping group design and overlapping group design. The local privacy results reveal that the probability of linking a query to the user depends on the group size and group count. The higher the group size or group count, the lower the likelihood of relating the query to the user. The profile privacy computes the profile obfuscation level using a privacy metric Profile Exposure Level (PEL). Different experiments have been performed to compute the profile privacy of the subset of an AOL query log for two situations: i) self-query submissions allowed and ii) self-query submissions not allowed. The privacy achieved by MG-OSLo is compared with the modern privacy-preserving protocol UUP(e), OSLo, and Co-utile protocols. The results show that the MG-OSLo provided better results as compared to OSLo, UUP, and Co-utile. Similarly, the multi-group has a positive impact on local privacy and user profile privacy.
... Second, a PIR algorithm is executed to privately obtain all the POI records in the designated public cell. Recently, Ullah et al. introduced Poshida, a private information retrieval protocol for WSE applications [204]. They propose to obfuscate the user profile collected and maintained by WSE. ...
Article
Personal data are often collected and processed in a decentralized fashion, within different contexts. For instance, with the emergence of distributed applications, several providers are usually correlating their records, and providing personalized services to their clients. Collected data include geographical and indoor positions of users, their movement patterns as well as sensor-acquired data that may reveal users' physical conditions, habits and interests. Consequently, this may lead to undesired consequences such as unsolicited advertisement and even to discrimination and stalking. To mitigate privacy threats, several techniques emerged, referred to as Privacy Enhancing Technologies, PETs for short. On one hand, the increasing pressure on service providers to protect users' privacy resulted in PETs being adopted. One the other hand, service providers have built their business model on personalized services, e.g. targeted ads and news. The objective of the paper is then to identify which of the PETs have the potential to satisfy both usually divergent - economical and ethical - purposes. This paper identifies a taxonomy classifying eight categories of PETs into three groups, and for better clarity, it considers three categories of personalized services. After defining and presenting the main features of PETs with illustrative examples, the paper points out which PETs best fit each personalized service category. Then, it discusses some of the inter-disciplinary privacy challenges that may slow down the adoption of these techniques, namely: technical, social, legal and economic concerns. Finally, it provides recommendations and highlights several research directions.
... Un algorithme de transfert inconscient (voir page 104) est d'abord utilisé pour projeter la localisation de l'utilisateur dans une zone de la carte du service. Puis un algorithme PIR est employé pour obtenir tous les points d'intérêt présents dans cette zone.Plus récemment en 2016, Ullah et al.[114] ont proposé Poshida, un protocole PIR pour l'application du moteur de recherche web. L'idée est d'opacifier le profil utilisateur collecté et maintenu par le moteur de recherche. ...
Book
Full-text available
Mieux comprendre les choix d’outils et de stratégies pour préserver la vie privée. D’un côté, des utilisateurs finaux très enclins à bénéficier de recommandations ciblées, d’offres promotionnelles pour un service de proximité ou un produit, des films, des livres, etc. D’autre part des entreprises soucieuses de fidéliser leurs clients en améliorant leur qualité d’expérience et justifiant de cette façon les opérations de collecte de données à caractère personnel qu’ils valorisent par ailleurs. Tel est le constat actuel qui pousse les entreprises à proposer de plus en plus de services personnalisés s’appuyant sur la collecte massive de données à caractère personnel. Cette tendance n’est toutefois pas sans risques -des risques de fuite de données personnelles, d’abus quant à l’exploitation de ces données, de dérives vers des environnements de surveillance –pour n’en citer que quelques-uns. Dans ses derniers travaux de recherche, la Chaire VPIP s’est penchée sur ce problème et a établi une taxonomie et un panorama des technologies préservant la vie privée (Privacy Enhancing Technologies, PETs). A partir de ces résultats scientifiques intitulés "Privacy Enhancing Technologies for solving the Privacy-Personalization Paradox: Taxonomy andSurvey”, un livre électronique a été produit. Destiné à un lectorat qui peut être novice ou avancé sur ces sujets, ce livre est proposé sous différents formats de lecture. Dans un contexte grandissant de personnalisation des services numériques, il permet à chacun de se situer dans les divers choix d’outils et de stratégies pour préserver la vie privée. Ainsi huit technologies préservant la vie privée, qui peuvent être rattachées à trois groupes distincts sont passées en revue et expliquées simplement : • Les techniques orientées utilisateur, qui nécessitent que l’utilisateur gère lui-même la protection de son identité par l’installation de logiciels spécifiques, de la certification à la divulgation contrôlée des attributs le décrivant, • Les techniques orientées serveur, qui nécessitent des outils de traitement de données capables de les anonymiser et d’effectuer des calculs sur des données chiffrées, • Les techniques orientées canal de communication, qui mettent en exergue les caractéristiques de ce dernier, comme le chiffrement des données, ainsi que l’utilisation de serveurs intermédiaires.
Chapter
Web search engine (WSE) is an inevitable software system used by people worldwide to retrieve data from the web by using keywords called queries. WSE stores search queries to build the user's profile and provide personalized results. User search queries often hold identifiable information that could compromise the user's privacy. Preserving privacy in web searches is the primary concern of users from various backgrounds. Many techniques have been proposed to preserve a person's web search privacy with time. Some techniques preserve an individual's privacy by obfuscating a user's profile by sending fictitious queries with the original ones. Others hide their identity and preserve privacy through unlinkability. However, a distributed technique preserves privacy by providing unlinkability and obfuscation. In distributed protocols, a group of users collaborate to forward each other queries to WSE, providing unlinkability and obfuscation. This work presents a survey of distributed privacy-preserving protocols. The benefits, limitations, and evaluation parameters are detailed in this work.
Chapter
Privacy quantification methods are used to quantify the knowledge the adverse search engine has obtained with and without privacy protection mechanisms. Thus, these methods calculate privacy exposure. Private web search techniques are based on many methods (e.g., proxy service, query modification, query exchange, and others). This variety of techniques prompted the researchers to evaluate their work differently. This section introduces the metrics used to evaluate user privacy (protection). Moreover, this section also introduces the metrics used to evaluate the performance of privacy attacks and theoretical evaluation approaches.
Article
Full-text available
We consider the problem of privacy leaks suffered by Internet users when they perform web searches, and propose a framework to mitigate them. In brief, given a ‘sensitive’ search query, the objective of our work is to retrieve the target documents from a search engine without disclosing the actual query. Our approach, which builds upon and improves recent work on search privacy, approximates the target search results by replacing the private user query with a set of blurred or scrambled queries. The results of the scrambled queries are then used to cover the private user interest. We model the problem theoretically, define a set of privacy objectives with respect to web search and investigate the effectiveness of the proposed solution with a set of queries with privacy issues on a large web collection. Experiments show great improvements in retrieval effectiveness over a previously reported baseline in the literature. Furthermore, the methods are more versatile, predictably-behaved, applicable to a wider range of information needs, and the privacy they provide is more comprehensible to the end-user. Additionally, we investigate the perceived privacy via a user study, as well as, measure the system’s usefulness taking into account the trade off between retrieval effectiveness and privacy. The practical feasibility of the methods is demonstrated in a field experiment, scrambling queries against a popular web search engine. The findings may have implications for other IR research areas, such as query expansion, query decomposition, and distributed retrieval.
Conference Paper
Full-text available
Web search queries reveal extensive information about users' personal lives to the search engines and Internet eavesdrop-pers. Obfuscating search queries through adding dummy queries is a practical and user-centric protection mechanism to hide users' search intentions and interests. Despite few such obfuscation methods and tools, there is no generic quantitative methodology for evaluating users' web-search privacy. In this paper, we provide such a methodology. We formalize adversary's background knowledge and attacks, the users' privacy objectives, and the algorithms to eval-uate effectiveness of query obfuscation mechanisms. We build upon machine-learning algorithms to learn the link-ability between user queries. This encompasses the adver-sary's knowledge about the obfuscation mechanism and the users' web-search behavior. Then, we quantify privacy of users with respect to linkage attacks. Our generic attack can run against users for which the adversary does not have any background knowledge, as well as for the cases where some prior queries from the target users are already observed. We quantify privacy at the query level (the link between user's queries) and the semantic level (user's topics of interest). We design a generic tool that can be used for evaluating generic obfuscation mechanisms, and users with different web search behavior. To illustrate our approach in practice, we analyze and compare privacy of users for two example obfuscation mechanisms on a set of real web-search logs.
Conference Paper
Full-text available
Forming an accurate mental model of a user is crucial for the qualitative design and evaluation steps of many information-centric applications such as web search, content recommendation, or advertising. This process can often be time-consuming as search and interaction histories become verbose. In this work, we present and analyze the usefulness of concise human-readable user profiles in order to enhance system tuning and evaluation by means of user studies.
Article
In 2010, Lindell and Waisbard proposed a private web search scheme for malicious adversaries. At the end of the scheme, each party obtains one search word and queries the search engine with the word. We remark that a malicious party could query the search engine with a fake word instead of the word obtained. The malicious party can link the true word toits provider if the victim publicly complain for the false searching result. To x this drawback, each party has to broadcast all shares so as to enable every party to recover all search words and query the search engine with all these words. We also remark that, from a user's perspective, there is a very simple method to achieve the same purpose of private shuffle. When a user wants to privately query the search engine with a word, he can pick another n-1 padding words to form a group of n words and permute these words randomly. Finally, he queries the search engine with all these words.
Article
Buried in a list of 20 million Web search queries collected by AOL and recently released on the Internet is user No. 4417749. The number was assigned by the company to protect the searcher's anonymity, but it was not much of a shield. No. 4417749 conducted hundreds of searches over a three-month period on topics ranging from "numb fingers" to "60 single men" to "dog that urinates on everything." And search by search, click by click, the identity of AOL user No. 4417749 became easier to discern. There are queries for "landscapers in Lilburn, Ga," several people with the last name Arnold and "homes sold in shadow lake subdivision gwinnett county georgia." It did not take much investigating to follow that data trail to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends' medical ailments and loves her three dogs. "Those are my searches," she said, after a reporter read part of the list to her. AOL removed the search data from its site over the weekend and apologized for its release, saying it was an unauthorized move by a team that had hoped it would benefit academic researchers. But the detailed records of searches conducted by Ms. Arnold and 657,000 other Americans, copies of which continue to circulate online, underscore how much people unintentionally reveal about themselves when they use search engines — and how risky it can be for companies like AOL, Google and Yahoo to compile such data. Those risks have long pitted privacy advocates against online marketers and other Internet companies seeking to profit from the Internet's unique ability to track the comings and goings of users, allowing for more focused and therefore more lucrative advertising. But the unintended consequences of all that data being compiled, stored and cross-linked are what Marc Rotenberg, the executive director of the Electronic Privacy Information Center, a privacy rights group in Washington, called "a ticking privacy time bomb." Mr. Rotenberg pointed to Google's own joust earlier this year with the Justice Department over a subpoena for some of its search data. The company successfully fended off the agency's demand in court, but several other search companies, including AOL, complied. The Justice Department sought the information to help it defend a challenge to a law that is meant to shield children from sexually explicit material.
Article
Republican politics, has an interest in the Bible and contributes to political and environmental causes. Mrs. Twombly's profile is part of RapLeaf's rich trove of data, garnered from a variety of sources and which both political parties have tapped. A company called RapLeaf is building databases on people by tapping voter-registration files, shopping histories, social-networking activities and real estate records. WSJ's Emily Steel and Julia Angwin join the Digits show to discuss which sites are using RapLeaf, and what web users can do to try to protect their privacy.
Conference Paper
Web search is an integral part of our daily lives. Recently, there has been a trend of personalization in Web search, where different users receive different results for the same search query. The increasing personalization is leading to concerns about Filter Bubble effects, where certain users are simply unable to access information that the search engines' algorithm decides is irrelevant. Despite these concerns, there has been little quantification of the extent of personalization in Web search today, or the user attributes that cause it. In light of this situation, we make three contributions. First, we develop a methodology for measuring personalization in Web search results. While conceptually simple, there are numerous details that our methodology must handle in order to accurately attribute differences in search results to personalization. Second, we apply our methodology to 200 users on Google Web Search; we find that, on average, 11.7% of results show differences due to personalization, but that this varies widely by search query and by result ranking. Third, we investigate the causes of personalization on Google Web Search. Surprisingly, we only find measurable personalization as a result of searching with a logged in account and the IP address of the searching user. Our results are a first step towards understanding the extent and effects of personalization on Web search engines today.
Article
The quality of results to Web search queries is substantially limited because of the cost and short processing times allowed at search engine's data center to retrieve relevant pages, augment ads, and present them to the end-user. We tackle such an issue by proposing a radically different system, where the search engine replies to a query with a large list of relevant URLs. Our client-side platform then proceeds to download the target HTML files only, parse them, understand their content, and present a summary to the user. Different types of summaries can be created by using plug-ins attached to the base platform. Each plug-in provides the required functionality according to the type of summary desired by the user. With this novel mechanism, our system offers increased computational power for post-processing search results and consequently improves and personalizes the user's search experience while maintaining constant workload at the search engine. We present prototype implementations of the proposed search assistant and an associated shopping plug-in capable of detecting whether a Web-page encapsulates a direct commercial offering. We also review measurements related to projected system performance and demonstrate its applicability to different scenarios.
Article
In this paper we introduce a system called Crowds for protecting users' anonymity on the world-wide-web. Crowds, named for the notion of “blending into a crowd,” operates by grouping users into a large and geographically diverse group (crowd) that collectively issues requests on behalf of its members. Web servers are unable to learn the true source of a request because it is equally likely to have originated from any member of the crowd, and even collaborating crowd members cannot distinguish the originator of a request from a member who is merely forwarding the request on behalf of another. We describe the design, implementation, security, performance, and scalability of our system. Our security analysis introduces degrees of anonymity as an important tool for describing and proving anonymity properties.