INFORMATION RETRIEVAL (IR) (PRIVATE VS. PUBLIC) VENINGSTON. K Ph.D. Student, Department of CSE, Government College of Technology, Coimbatore. [email protected]
Jun 21, 2015
INFORMATION RETRIEVAL (IR)
(PRIVATE VS. PUBLIC)
VENINGSTON. K
Ph.D. Student, Department of CSE,
Government College of Technology, Coimbatore.
PRESENTATION OUTLINE
Public IR
What is Web IR?
Overview of Web IR Technologies
Web IR Models
Web Search architecture
Semantic Matching
Personalization in Web IR
Challenges in Web based IR
Challenges in Personalizing Web IR
Summary Note
Private IR
What is Private IR?
How Does It Work?
PIR Model
Approaches to PIR
PIR Properties
Summary Note
2
11
/Dece
mb
er/2
01
3A
ICT
E F
DP
on
Web
Ap
plica
tion
Secu
rity
WHY INFORMATION RETRIEVAL? 11
/Dece
mb
er/2
01
3
3
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
WEB INFORMATION RETRIEVAL
(WEB SEARCH)
Technologies for helping users to accurately,
quickly, and easily find information on the web
11
/Dece
mb
er/2
01
3
4
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
GOAL OF WEB SEARCH
Accurate Efficient Easy to Use
Results are
relevant
Response time
is short
Good user
experience
Results are
comprehensive
Results are
novel
Fast task
completion
11
/Dece
mb
er/2
01
3
5
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
WEB USERS HEAVILY RELY ON SEARCH
ENGINES
11
/Dece
mb
er/2
01
3
6
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
HUGE DATA CENTERS 11
/Dece
mb
er/2
01
3
7
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
OVERVIEW OF WEB SEARCH
TECHNOLOGIES
General Web Search, Entity Search, Facet
Search, Question Answering, Multimedia Search
Ranking, Matching, Retrieval Document
Understanding, Query Understanding, Crawling,
Indexing, Result Presentation, Anti-spam
Classification, Clustering, Ranking, Graph
Learning, Tagging, Distributed Computing
11
/Dece
mb
er/2
01
3
8
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
WEB SEARCH ARCHITECTURE
Query
StringIR
System
Ranked
Documents
1. Page1
2. Page2
3. Page3
.
.
Document
corpus
Web Spider
9
11
/Dece
mb
er/2
01
3
9
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
COMPONENT TECHNOLOGIES FOR WEB IR
Relevance Ranking
Importance Ranking
Web Page Understanding
Query Understanding
Crawling
Indexing
Search Result Presentation
Anti-Spam
Search Log Data Mining / Web Mining
11
/Dece
mb
er/2
01
3
10
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
THREE IMPORTANT PROCESSES IN WEB IR
Retrieval
Finding documents from inverted index
Matching
Calculating relevance score between query and
document pair
Ranking
Ranking documents based on relevance scores,
importance scores, etc.,
11
/Dece
mb
er/2
01
3
11
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
WEB IR MODELS
Vector Space Model (Salton 1975 )
Probabilistic Model
Okapi or BM25 Model (Robertson and Walker
1994 )
Language Model (Ponte and Croft 1998 )
User Model
11
/Dece
mb
er/2
01
3
12
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
VECTOR SPACE MODEL 11
/Dece
mb
er/2
01
3
13
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
PROBABILISTIC MODEL 11
/Dece
mb
er/2
01
3
14
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
OKAPI OR BM25 MODEL 11
/Dece
mb
er/2
01
3
15
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
LANGUAGE MODEL 11
/Dece
mb
er/2
01
3
16
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
USER MODEL
User models are personal characteristics of the
user that the system maintains
A user profile can be thought as a user model
Types of user models Depending on the user being modeled
Individual
Canonical (group)
Depending on Acquisition model
Explicit (stated)
Implicit (inferred)
11
/Dece
mb
er/2
01
3
17
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
SEMANTIC MATCHING 11
/Dece
mb
er/2
01
3
18
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
PERSONALIZATION - ENVIRONMENTS WHERE
IS BEING USED
Databases
Newsgroups
Personal Information Management (desktop files, E-mail,
bookmarks, etc.)
News: electronic journals
Search engines
Web sites
Business
e-commerce
e-health
e-etc.,
11
/Dece
mb
er/2
01
3
19
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
OBJECTIVES
To enhance the Personalized Web Search and
Retrieval with an intention to satisfy user‟s search
context
To customize the Web Information Retrieval (IR)
for users.
To Provide results specific to individual users.
It is predominantly important because different users
expect different information even for the same query
To predict whether personalization required or not
To develop Computationally intelligent and
efficient algorithm for this personalization task
11
/Dece
mb
er/2
01
3
20
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
PERSONALIZATION IN WEB IR [1/2]
Web Personalization is viewed as an application
of data mining and machine learning techniques
to build models of user behavior that can be
applied to the task of predicting user needs and
adapting future interactions with the ultimate
goal of improved user satisfaction.
11
/Dece
mb
er/2
01
3
21
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
PERSONALIZATION IN WEB IR [2/2]
Initially Search engines were concerned with
retrieving relevant documents to a query.
Within the information overload on the web,
it is increasingly difficult for search engines
to satisfy the individual user needs.
Personalization has long been recognized as
an avenue to greatly improve search
experience.
Disambiguates the web search by modeling
the user profile by his/her interests and
preferences.
11
/Dece
mb
er/2
01
3
22
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
PROBLEM DESCRIPTION
Personalization in Web IR
Customize search results according to each individual user
Research questions in Personalized Web IR
What to use to Personalize?
How to model and represent past search contexts?
How to Personalize?
How to use it to improve search results?
When not to Personalize?
How to decide whether personalization required or not?
How to know Personalization helped?
How to evaluate personalized results?
11
/Dece
mb
er/2
01
3
23
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
GENERAL PROBLEM STATEMENT
When search query is issued, most of the search
engines return the same results irrespective of
the users interest
Lack the existence of semantic structure and
hence it makes difficult for the machine to
understand the information provided by the user
Lack in Identifying intention of the user
Lack in processing Inaccurate / Ambiguous
queries imprecise keyword
11
/Dece
mb
er/2
01
3
24
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
RELATED WORKS
Short term personalization - book mark
Long term personalization - browsing history
Result Diversification - Query reformulation
Collaborative personalization - for group of
users
Search interaction personalization - Clicks
Session based personalization
Location based personalization
Task based personalization
and so on…
11
/Dece
mb
er/2
01
3
25
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
ARCHITECTURE OF PERSONALIZATION BASED
WEB IR
Rankings
Document
corpus
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
1. Doc1
2. Doc2
3. Doc3
.
.
Feedback
Query
String
Revise
d
Query
Re-Ranked
Documents
1. Doc2
2. Doc4
3. Doc5
.
.
Query
Reformulation
Personalized
IR
Web
11
/Dece
mb
er/2
01
3
26
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
CHALLENGES FOR WEB IR
Distributed Data: Documents spread over millions of different web servers.
Volatile Data: Many documents change or disappear rapidly (e.g. dead links).
Large Volume: Billions of separate documents.
Unstructured and Redundant Data: No uniform structure, HTML errors, up to 30% near duplicate documents.
Quality of Data: No editorial control, false information, poor quality writing, typos, etc.
Heterogeneous Data: Multiple media types (images, video), languages, character sets, etc.
11
/Dece
mb
er/2
01
3
27
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
CHALLENGES FOR PERSONALIZATION IN
WEB IR
From the system centered approach to a
user centered approach to IR
Modeling the user context in personalized
IR
Exploiting the user context to enhance
search quality
The privacy issues
The evaluation issues
11
/Dece
mb
er/2
01
3
28
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
Focused on the
next part of
presentation
POSSIBLE APPROACHES TO INFORMATION
RETRIEVAL
Statistical approaches
◦ Co-occurrence of features between document
and query
◦ Rank documents based on similarity
Semantic approaches
◦ “Understand” the query, find matching
documents
User profile approaches
◦ User profiles store approximations of user
interests
11
/Dece
mb
er/2
01
3
29
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
BENEFITS OF PERSONALIZED SEARCH
Resolving ambiguity
The profile provides a context to the query in order
to reduce ambiguity.
Example: The profile of interests will allow to distinguish what
the user asked about “Jaguar” (“Animal”, “Car”) really wants
Revealing hidden treasures
The profile allows to bring the most relevant
documents, which could be hidden beyond top
results page
Example: Owner of iPhone searches for Google Android. Pages
referring to both would be most interesting
11
/Dece
mb
er/2
01
3
30
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
WHERE TO APPLY USER PROFILES?
The user profile can be applied in several ways
To modify the query itself pre-processing
Query Expansion User profile is applied to add
terms to the query
To process results of a query post-processing
To present document snippets
Adaptation of meta-search
11
/Dece
mb
er/2
01
3
31
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
VARIATIONS OF USER PROFILE USAGE1
1/D
ece
mb
er/2
01
3
32
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
SUMMARY ON IR
Web Information Retrieval is a very challenging
yet exciting area!
Solution: Learning individual user to match the
query with the document
Personalized Web Information Retrieval
Promises significant quality improvements. However,
they are far from optimal
Thus, more research is necessary in the field of IR
“Computational Intelligence“ could be adopted by
search tools to manage effectively search,
retrieval, filtering and presenting relevant
information.
11
/Dece
mb
er/2
01
3
33
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
PRIVATE INFORMATION RETRIEVAL (PIR)
[1995]
Goal: allow user to query database while hiding the identity of the data-items.
Note: hides identity of data-items; not existence of interaction with the user.
Motivation: patent databases; stock quotes; web access and so on.
Paradox(?): imagine buying in a store without the seller knowing what you buy.
(Encrypting requests is useful against third parties; not against owner of data.)
11
/Dece
mb
er/2
01
3
34
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
WHAT IS PRIVATE INFORMATION
RETRIEVAL?
Real-World Example:
Suppose there is a movie database and we
want to find information on the movie „Indian‟
We do not want anyone to know about our
interest in this movie.
11
/Dece
mb
er/2
01
3
35
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
THE GOAL OF PIR
Suppose there is a movie database and we want
to find information on the movie „Endiran‟
We do not want the database operator to know
about our interest in this movie.
Users' intentions are to be kept secret
11
/Dece
mb
er/2
01
3
36
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
HOW DOES IT WORK?
Very Simple approach
Download the entire database
Improved approach
Suppose there is a database with blocks D1,…, Dr.
A client wants to retrieve block Dα from the database
in such a way that the database operator learns
nothing about α.
Do this without downloading the entire database.
11
/Dece
mb
er/2
01
3
37
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
GOLDBERG‟S SCHEME
We can represent a database of r blocks as an rxs
matrix D and get the αth block (αth row) of D
using simple linear algebra
Dα = eα.D
Where eα =[0 0 … 1… 0] is a vector with all zeros,
except a one for the α coordinate.
There are l servers, each with a copy of the
database.
We secretly share eα in to v1,….,vl and send one to
each server.
Each server computes and sends their response
ri=vi.D
11
/Dece
mb
er/2
01
3
38
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
GOLDBERG‟S SCHEME
The responses r1,….rk are secret shares for Dα. (k
is the number of responses)
What happens if some of the responses are
wrong?
11
/Dece
mb
er/2
01
3
39
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
AOL SEARCH LOG DATA SCANDAL
#4417749: clothes for age 60
60 single men
best retirement city
jarrett arnold
jack t. arnold
jaylene and jarrett arnold
gwinnett county yellow pages
rescue of older dogs
movies for dogs
sinus infection
Thelma Arnold
62-year-old widow
Lilburn, Georgia
11
/Dece
mb
er/2
01
3
40
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
OBSERVATION
The owners of databases know a lot about the users!
This poses a risk to users‟ privacy.
E.g. consider database with stock prices
What can we do?
Trust them that they will protect our secrecy,
or
Use Cryptography
11
/Dece
mb
er/2
01
3
41
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
HOW CAN CRYPTO HELP?
Note: This problem has nothing to do with
secure communication!
user U database D
11
/Dece
mb
er/2
01
3
42
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
CURRENT SETTING
user Udatabase D
A new primitive:
Private Information Retrieval (PIR)
secure link
11
/Dece
mb
er/2
01
3
43
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
MODELING PIR
Server: holds n-bit string x
n should be thought of as very large
User: desires to retrieve xi and
to keep i private
11
/Dece
mb
er/2
01
3
44
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
x=x1,x2 , . . ., xn {0,1}n
SERVER
i {1,…n}
xi
USER
i j
PRIVATE PROTOCOL TO INFORMATION
RETRIEVAL
11
/Dece
mb
er/2
01
3
45
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
There is NO privacy preservation.
Communication Cost: log n
SERVER
USER
x =x1,x2 , . . ., xn
xi
NON-PRIVATE PROTOCOL
i
i {1,…n}
11
/Dece
mb
er/2
01
3
46
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
Server sends entire database x to User.
Information theoretic privacy.
Communication Cost: n
SERVER
xi
USER
x =x1,x2 , . . ., xn
x1,x2 , . . ., xn
TRIVIAL PRIVATE PROTOCOL
Is this optimal?
“The number of bits communicated between U and S has to be smaller
than n.”
11
/Dece
mb
er/2
01
3
47
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
PROBLEM
In any 1-server PIR with information
theoretic privacy the communication is at
least n.
11
/Dece
mb
er/2
01
3
48
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
POSSIBLE SOLUTIONS
User is asked for additional random indices.
Drawback: reveals a lot of information
Employ general crypto protocols to compute xi
privately.
Drawback: highly inefficient (polynomial in n).
Anonymity.
Note: Hides identity of user; not the fact that xi is retrieved.
11
/Dece
mb
er/2
01
3
49
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
ANONYMITY - EXAMPLE
Original Data vs. Anonymized Data
11
/Dece
mb
er/2
01
3
50
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
TWO APPROACHES
Information-Theoretic PIR
Replicate database among k servers.
Unconditional privacy against t servers.
Computational PIR
Computational privacy, based on cryptographic assumptions.
11
/Dece
mb
er/2
01
3
51
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
INFORMATION THEORETIC PRIVACY
(PERFECT PRIVACY)
The distribution of the queries the user sends to
any server is independent of the index he/she
wishes to retrieve.
This means that each server cannot gain any
information about user‟s interest regardless of
his computational power.
11
/Dece
mb
er/2
01
3
52
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
COMPUTATIONAL PRIVACY
The distributions of the queries the user sends to
any server are computationally indistinguishable
by varying the index.
This means that each server cannot gain any
information about user‟s interest provided that
he/she is computationally bounded.
11
/Dece
mb
er/2
01
3
53
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
COMMUNICATION COST
Multiple servers, information-theoretic
PIR: 2 servers, comm. n1/2
k servers, comm. n1/k
log n servers, comm. Poly( log(n) )
Single server, computational PIR: Comm. Poly( log(n) )
11
/Dece
mb
er/2
01
3
54
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
K-SERVER PIR
Correctness: User
obtains xi
Privacy: No single
server gets
information about i
U
S1x {0,1}n
S2x {0,1}n
i
x {0,1}nSk
11
/Dece
mb
er/2
01
3
55
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
input:
PIR PROPERTIES
B1 B2 … Bw
input:
index i = 1,…,w
• the user learns Bi
• the database does not learn i
• the total communication is < w
Note: secrecy of the database is not required
correctness
secrecy (of the user)
non-triviality
These properties needs to be defined more formally!
polynomial time randomized interactive algorithms
11
/Dece
mb
er/2
01
3
56
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
PIR PROPERTIES
Correctness
In every invocation of the protocol the user retrieves
the bit he is interested in (i.e. xi)
Privacy
In every invocation of the protocol each server does
not gain any information about the index of the bit
retrieved by the user (i.e. i).
11
/Dece
mb
er/2
01
3
57
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
PIR DOESN‟T EXISTS [1/4]
Correctness, Non-triviality and Secrecy CANNOT be satisfied simultaneously.
Def: A transcript T is possible for (i,B) if P(T(i,B) = T) > 0
Take some T’, and look where it is possible:
T’ T’
T’ T’
indices i
data
base
s B
11
/Dece
mb
er/2
01
3A
ICT
E F
DP
on
Web
Ap
plica
tion
Secu
rity
58
PIR DOESN‟T EXISTS [2/4]
secrecy → if
T’ is possible for some B and i
then
it is possible for B and all the other i’s
T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’
T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’
indices i
data
base
s B
T’ T’
T’ T’
11
/Dece
mb
er/2
01
3A
ICT
E F
DP
on
Web
Ap
plica
tion
Secu
rity
59
PIR DOESN‟T EXISTS [3/4]
non-triviality → length(transcript) < length(database)↓
# transcripts < #databases↓
there has to exist T’ that is possible for two databases B0 and B1
T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’
T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’
data
base
s B
← B0
← B1
indices i
11
/Dece
mb
er/2
01
3A
ICT
E F
DP
on
Web
Ap
plica
tion
Secu
rity
60
PIR DOESN‟T EXISTS [4/4]
B0 and B1 differ on at least one index i’. So, if i’ is the input of the user then
correctness → contradiction
T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’
T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’
data
base
s B
← B0
← B1
i‟
↓
indices i
11
/Dece
mb
er/2
01
3A
ICT
E F
DP
on
Web
Ap
plica
tion
Secu
rity
61
THUS, IDEAL PIR DOESN‟T EXIST!
How to bypass the impossibility result?
Two ideas:
limit the computing power of a cheating database
use a larger number of “independent” databases
11
/Dece
mb
er/2
01
3A
ICT
E F
DP
on
Web
Ap
plica
tion
Secu
rity
62
SUMMARY
Complexity of PIR
Communication
Computation
Possible Extensions
Symmetric PIR
User may not learn any item other than the one he/she
requested
Searching by key-words
Public-key encryption with key-word search
11
/Dece
mb
er/2
01
3
63
AIC
TE
FD
P o
n W
eb
Ap
plica
tion
Secu
rity
REFERENCES
Xiaohui Tao, Yuefeng Li, and Ning Zhong, “A Personalized Ontology model for Web information gathering”, IEEE Trans. Knowledge and Data Engg., vol.23, No. 4, pp 496-511, April 2011.
Markus Strohmaier, Mark Kr¨oll“Acquiring Knowledge about human goals from search query logs”, ACM Transactions on Information System, March 2011.
K.W.-T. Leung, W. Ng, and D.L. Lee, “Deriving Concept- Based User Profiles from Search Engine Logs,” IEEE Trans. Knowledge and Data Engg., vol. 22, no. 7, pp 969-982, July. 2010.
Zhicheng Dou, Ruihua Song, Ji-Rong Wen, and Xiaojie Yuan, “Evaluating the Effectiveness of Personalized Web Search” IEEE Trans. Knowledge and Data Engg., Vol. 21, No. 8,pp 1178-1190, Aug 2009.
Y. Li and N. Zhong. “Mining Ontology for Automatically Acquiring Web User Information Needs”, IEEE Transactions on Knowledge and Data Engg., 18(4), pp 554-568, April 2006.
Fang Liu, Clement Yu, Weiyi Meng, “Personalized Web Search for Improving Retrieval Effectiveness” IEEE Trans. Knowledge and Data Engg., Vol. 16, No. 1,pp 28-40, January 2004.
B. Chor, O. Goldreich, E. Kushilevitz, and M. Sudan, “Private information retrieval”. Journal of the ACM 45(6),pp 965-982, 1995.
THANKING YOU