IMA 8/11/2006Advanced Math Search1 Relevance Ranking and Hit Packaging in Math Search Abdou Youssef The George Washington University And The National Institute.

IMA 8/11/2006 Advanced Math Search 1

Relevance Ranking and Hit Packaging in Math Search

Abdou Youssef

The George Washington University

And

The National Institute of Standards and Technology

(DLMF)


Outline What are relevance ranking and hit

packaging Why relevance ranking and hit packaging Math-relevance ranking: factors & methods Math-hit packaging: issues and method


Relevance Ranking: What and Why What:

Measuring the relevance of each hit to a query Sorting the hits from the most to the least relevant

Why: Numbers of hits are expected to be in the hundreds and

even thousands Too taxing, tedious and time consuming for users to

plow through the hits looking for the relevant one(s)


Hit-Packaging: What and Why

What: Providing with each hit short, representative

excerpts from the corresponding document Why:

Numbers of hits in the hundreds/thousands Relevance ranking may not be perfect There could be several objectively top-ranking

equally relevant hits. Brief hit-descriptions help users select


Relevance Ranking How it is typically done

For any document d and query q, the relevance score of d is:

qintterms d

dintfreq......

t..having.docs.NumDB.in.docs.Num ||/)log(

)..(


Relevance Ranking How it is typically done (Contd.)

Some search systems allow users to boost some terms over others in a query:

ttermsquery d

qintBoostdintfreq..

t..having.docs.NumDB.in.docs.Num ||/)log(

)..()..(


Why Current Ranking Schemes Not Good for Math Search

Length of a math object (e.g., equation) often has no bearing on its relevance/importance

Frequency of a term in a math object also has no bearing on the math object

Many considerations that impact the relevance/importance of a math object are not captured by the text-IR relevance metric


Factors to Consider in Math Relevance Ranking Static Factors Static Weighting Dynamic Factors Dynamic Weighting


Static Weighting: Determined fully by content/author Not all objects in a math file are of equal

importance

Therefore, when ranking hits, the nature of hit-contents must be factored in Some objects must be given more weight than

others in calculating the relevance score of a hit


Static Weighting:Possible Hierarchies of Weights

Definitions

Theorems

Propositions Corollaries

Lemmas

SpecialFunctions

Operators

Other mathidentifiers

Expert-Ranked

Formulas.....


Static Weighting: Native vs. Non-native Entities

Native entities An entity (e.g., term, concept, special function)

should carry more weight in its “native chapter” than in passing references in other chapters

Native connections A connection between two entities should carry

more weight in the chapter of either entity than in other entities


Dynamic Weighting:Determined by Query/Users

Query-biased weighting Number and weights of external references

to an item Number of recent accesses to an item

By the same user in current session By multiple users in the last N days


A New Model of Relevance Metrics

For any math document/object d and query q, the relevance score of d is:

The functions f and g are attenuating functions

ttermsquery dthavingdocsNum

DBindocsNumdintfreq

dweighthdtweightgtweightf

.. ||/)....

...log()..(

))(()),(())((


Weight(t,d) Weight(t,d) defines the weight of term t

Intrinsically, and In the context of document d

)),(context())(type(),Weight( dttdt


Weight(d)

Weight(d) defines the weight of the document/object d depending on The nature/type of d The number of pointers to d The number of accesses to d

)essesRecent.Acc(#

)Pointers(#))(type()Weight(

dd


This math-specific relevance scoring scheme is currently being developed and implemented for DLMF


Hit Packaging When the hit-content size is small,

display the whole content with the hit Equation hits: the equation itself Graph hits: The whole graph or just the

caption Table hits: the whole table or just the

caption Definition/Theorem/Notation hits: the

whole, unless it is embedded in a section


Hit Packaging:When the hit-content is too large

The hit package must be Excerpts from the corresponding document Short: 2-5-10 lines long Relevant to the query Representative of the document contents


How to Choose the Excerpts Divide document into small fragments

Titles (of sections, subsection, etc.) Equations Captions Sentences

Compute the relevance for each fragment Rank the fragments by their relevance Choose 5-10 top-ranking fragments


Implementation Matters Query processing and searching must be

fast Users cannot and should not wait too long Servers often have to serve many users at once

Therefore: Hit relevance scoring must be fast Hit packaging must be fast


Implementation Matters (Contd.)

Indexing can be slow, because it is done offline, ahead of search time

Therefore, compute and store in the index all kinds of information that Facilitate the relevance-scoring of hits Speed up document-fragmentation and

fragment-scoring for query-biased hit-packaging at search

time


Closing thoughts Math-specific relevance scoring and

hit packaging are critical to the success of math search

We barely started to scratch the surface

Much research will be needed

IMA 8/11/2006Advanced Math Search1 Relevance Ranking and Hit Packaging in Math Search Abdou Youssef The George Washington University And The National Institute.

Documents

math relevance

math object

advanced math search7

math file

hit slide

advanced math search8

advanced math search2

factors methods math