IMA 8/11/2006 Advanced Math Search 1 Relevance Ranking and Hit Packaging in Math Search Abdou Youssef The George Washington University And The National Institute of Standards and Technology (DLMF)
Dec 24, 2015
IMA 8/11/2006 Advanced Math Search 1
Relevance Ranking and Hit Packaging in Math Search
Abdou Youssef
The George Washington University
And
The National Institute of Standards and Technology
(DLMF)
IMA 8/11/2006 Advanced Math Search 2
Outline What are relevance ranking and hit
packaging Why relevance ranking and hit packaging Math-relevance ranking: factors & methods Math-hit packaging: issues and method
IMA 8/11/2006 Advanced Math Search 3
Relevance Ranking: What and Why What:
Measuring the relevance of each hit to a query Sorting the hits from the most to the least relevant
Why: Numbers of hits are expected to be in the hundreds and
even thousands Too taxing, tedious and time consuming for users to
plow through the hits looking for the relevant one(s)
IMA 8/11/2006 Advanced Math Search 4
Hit-Packaging: What and Why
What: Providing with each hit short, representative
excerpts from the corresponding document Why:
Numbers of hits in the hundreds/thousands Relevance ranking may not be perfect There could be several objectively top-ranking
equally relevant hits. Brief hit-descriptions help users select
IMA 8/11/2006 Advanced Math Search 5
Relevance Ranking How it is typically done
For any document d and query q, the relevance score of d is:
qintterms d
dintfreq......
t..having.docs.NumDB.in.docs.Num ||/)log(
)..(
IMA 8/11/2006 Advanced Math Search 6
Relevance Ranking How it is typically done (Contd.)
Some search systems allow users to boost some terms over others in a query:
ttermsquery d
qintBoostdintfreq..
t..having.docs.NumDB.in.docs.Num ||/)log(
)..()..(
IMA 8/11/2006 Advanced Math Search 7
Why Current Ranking Schemes Not Good for Math Search
Length of a math object (e.g., equation) often has no bearing on its relevance/importance
Frequency of a term in a math object also has no bearing on the math object
Many considerations that impact the relevance/importance of a math object are not captured by the text-IR relevance metric
IMA 8/11/2006 Advanced Math Search 8
Factors to Consider in Math Relevance Ranking Static Factors Static Weighting Dynamic Factors Dynamic Weighting
IMA 8/11/2006 Advanced Math Search 9
Static Weighting: Determined fully by content/author Not all objects in a math file are of equal
importance
Therefore, when ranking hits, the nature of hit-contents must be factored in Some objects must be given more weight than
others in calculating the relevance score of a hit
IMA 8/11/2006 Advanced Math Search 10
Static Weighting:Possible Hierarchies of Weights
Definitions
Theorems
Propositions Corollaries
Lemmas
SpecialFunctions
Operators
Other mathidentifiers
Expert-Ranked
Formulas.....
IMA 8/11/2006 Advanced Math Search 11
Static Weighting: Native vs. Non-native Entities
Native entities An entity (e.g., term, concept, special function)
should carry more weight in its “native chapter” than in passing references in other chapters
Native connections A connection between two entities should carry
more weight in the chapter of either entity than in other entities
IMA 8/11/2006 Advanced Math Search 12
Dynamic Weighting:Determined by Query/Users
Query-biased weighting Number and weights of external references
to an item Number of recent accesses to an item
By the same user in current session By multiple users in the last N days
IMA 8/11/2006 Advanced Math Search 13
A New Model of Relevance Metrics
For any math document/object d and query q, the relevance score of d is:
The functions f and g are attenuating functions
ttermsquery dthavingdocsNum
DBindocsNumdintfreq
dweighthdtweightgtweightf
.. ||/)....
...log()..(
))(()),(())((
IMA 8/11/2006 Advanced Math Search 14
Weight(t,d) Weight(t,d) defines the weight of term t
Intrinsically, and In the context of document d
)),(context())(type(),Weight( dttdt
IMA 8/11/2006 Advanced Math Search 15
Weight(d)
Weight(d) defines the weight of the document/object d depending on The nature/type of d The number of pointers to d The number of accesses to d
)essesRecent.Acc(#
)Pointers(#))(type()Weight(
dd
IMA 8/11/2006 Advanced Math Search 16
This math-specific relevance scoring scheme is currently being developed and implemented for DLMF
IMA 8/11/2006 Advanced Math Search 17
Hit Packaging When the hit-content size is small,
display the whole content with the hit Equation hits: the equation itself Graph hits: The whole graph or just the
caption Table hits: the whole table or just the
caption Definition/Theorem/Notation hits: the
whole, unless it is embedded in a section
IMA 8/11/2006 Advanced Math Search 18
Hit Packaging:When the hit-content is too large
The hit package must be Excerpts from the corresponding document Short: 2-5-10 lines long Relevant to the query Representative of the document contents
IMA 8/11/2006 Advanced Math Search 19
How to Choose the Excerpts Divide document into small fragments
Titles (of sections, subsection, etc.) Equations Captions Sentences
Compute the relevance for each fragment Rank the fragments by their relevance Choose 5-10 top-ranking fragments
IMA 8/11/2006 Advanced Math Search 20
Implementation Matters Query processing and searching must be
fast Users cannot and should not wait too long Servers often have to serve many users at once
Therefore: Hit relevance scoring must be fast Hit packaging must be fast
IMA 8/11/2006 Advanced Math Search 21
Implementation Matters (Contd.)
Indexing can be slow, because it is done offline, ahead of search time
Therefore, compute and store in the index all kinds of information that Facilitate the relevance-scoring of hits Speed up document-fragmentation and
fragment-scoring for query-biased hit-packaging at search
time