Alternatives to Federated Search - Presented by: Marc Krellenstein Date: July 29, 2005.

Alternatives to Federated Search

-

Presented by: Marc KrellensteinDate: July 29, 2005

| 2

Why did we ever build federated search? No one search service or database had all relevant info

or ever could have It was too hard to know what databases to search Even if you knew which db’s to search, it was too

inconvenient to search them all Learning one simple interface was easier than learning

many complex ones

| 3

Do we still need federated search? No

| 4

No one service or db has all relevant info? Databases have grown bigger than ever imagined

Google: 8B documents, Google scholar: 400M+ ? Scirus: 200M Web of Knowledge (Humanities, Social Sci, Science): 28M Scopus: 27M Pubmed: 14M

Why? Cheaper and larger hard disks Faster hardware, better software World-wide network availability…no need to duplicate

| 5

No one service or db has all relevant info? No maximum size in sight

A good thing, because content continues to grow The simplest technical model for search

Databases are logically single and central …but physically multiple and internally distributed Google has ~160,000 servers

The simplest user model for search The catch (but even worse for federated search):

Get the data Keep search quality high

| 6

It’s hard to know what services to search? Google/Google Scholar plus 1-2 vertical search tools

Pubmed, Compendex, WoK, PsycINFO, Scopus, etc. For casual searches: Google alone is usually enough

Specialized smaller db’s where needed Known to researcher or librarian, or available from list

Ask a life science researcher what they use -- “All I need is Google and Pubmed”

| 7

It’s hard to know what services to search? Alerts, RSS, etc. eliminate some searches altogether Still…more than one search/source…but must balance

inconvenience against costs of federated search: Will still need to do multiple searches…federated not enough Least common denominator search – few advanced features

» Users are increasingly sophisticated Duplicates Slower response time Broken connectors The feeling that you’re missing stuff…

| 8

One interface is easier to learn than many? Yes…studies suggest users like a common interface (if

not a common search service) BUT Google has demonstrated the benefits of simplicity More products are adopting simple, similar interfaces There is still too much proprietary syntax – though

advanced features and innovation justify some of it

| 9

So what are today’s search challenges? Getting the data for centralized and large vertical search

services Keeping search quality high for these large databases Answering hard search questions

| 10

Getting the data for centralized services Crawl it if it’s free …or make or buy it

Expensive, but usually worth the cost Should still be cheaper for customers than many services

…or index multiple, maybe geographically separate databases with a single search engine that supports distributed search

| 11

Distributed (local/remote) search Use common metadata scheme (e.g., Dublin Core) Search engine provides parallel search, integrated ranking/results

Google, Fast and Lucene already work this way even for ‘single’ database The separate databases can be maintained/updated separately Results are truly integrated…as if it’s one search engine

One query syntax, advanced capabilities, no duplicates, fast Still requires common technology platform Federated search standards may someday approximate this

Standard syntax, results metadata…ranking? Amazon’s A9?

| 12

Keeping search quality high in big db’s Can interpret keyword, Boolean and pseudo-natural language

queries Spell checking, thesauri and stemming to improve recall (and

sometimes precision) Get lots of hits in a big db, but that’s usually OK if there are good

ones on top

| 13

Keeping search quality high in big db’s Current best practice relevancy ranking is pretty good:

Term frequency (TF): more hits count more Inverse document frequency (IDF): hits of rarer search terms count more Hits of search terms near each other count more Hits on metadata count more

» Use anchor text – referring text – as metadata Items with more links/references to them count more

» Authoritative links/referrers count yet more Many other factors: length, date, etc.

Sophisticated ranking is a weak point for federated search Google’s genius: emphasize popularity to eliminate junk from the

first pages (even if you don’t always serve the best)

| 14

But search challenges remain Finding the best (not just good) documents

Popularity may not turn up the best, most recent, etc. Answering hard questions

Hard to match multiple criteria» find an experimental method like this one

Hard to get answers to complex questions, » What precursors were common to World War I and World War II?

Summarize, uncover relationships, analyze Long-term: understand any question… None of the above helped by least common denominator

federated search

| 15

Finding the best Don’t rely too much on popularity Even then, relevancy ranking has its limits

“I need information on depression” “Ok…here are 2,352 articles and 87 books”

Need a dialog…”what kind of depression” …”psychological”…”what about it?”

Underlying problem: most searches are under-specified

| 16

One solution: clustering documents Group results around common themes: same author, web site,

journal, subject… Blurt out largest/most interesting categories: the inarticulate librarian

model Depression psychology, economics, meteorology, antiques…

Psychology treatment of depression, depression symptoms, seasonal affective…

Psychology Kocsis, J. (10), Berg, R. (8), … Themes could come from static metadata or dynamically by

analysis of results text Static: fixed, clear categories and assignments Dynamic: doesn’t require metadata/taxonomy

| 17

Clustering benefits Disambiguates and refines search results to get to documents of interest

quickly Can navigate long result lists hierarchically

Would never offer thousands of choices to choose from as input… Access to bottom of list…maybe just less common Won’t work with federated search that retrieves limited results from each

Discovery – new aspects or sources Can narrow results *after* search

Start with the broadest area search – don’t narrow by subject or other categories first

Easier, plus can’t guess wrong, miss useful, or pick unneeded, categories…results-driven

» Knee surgery cartilage replacement, plastics, …

| 18

| 19

| 20

| 21

Answering hard questions Main problem is still short searches/under-specification One solution: Relevance feedback – marking good and bad

results A long-standing and proven search refinement technique

More information is better than less Pseudo-relevancy feedback is a research standard

Most commercial forms not widely used… …but Pubmed is an exception A catch: Must first find a good document to be similar to….may be

hard or impossible

| 22

One solution: descriptive search Let the user or situation provide the ideal “document” – a full

problem description – as input in the first place Can enter free text or specific documents describing the need, e.g., an

article, grant proposal or experiment description Might draw on user or query context Use thesauri, domain knowledge and limited natural language processing

to identify must-have’s Uses lots of data and statistics to find best matches

» Again, a problem for federated search with limited data access Should provide the best possible search short of real language

understanding

| 23

Summarize, discover & analyze How do you summarize a corpus?

May want to report on what’s present, numbers of occurrences, trends Ex: What diseases are studied the most? Must know all diseases and look one by one

How to you find a relationship if you don’t know what relationships exist?

Ex:does gene p53 relate to any disease? Must check for each possible relationship

Ad hoc analysis How do all genes relate to this one disease? Over time? What organisms

have the gene been studied in? Show me the document evidence

| 24

One solution: text mining Identify entities (things) in a text corpus

Examples: authors, universities… diseases, drugs, side-effects, genes…companies, law suits, plaintiffs, defendants…

Use lexicons, patterns, NLP for finding any or all instances of the entity (including new ones)

Identify relationships: Through co-occurrence

» Relationship presumed from proximity» Example: author-university affiliation

Through limited natural language processing» Semantic relations – causes, is-part-of, etc.» Examples: drug-causes-disease, drug-treats-disease» Identify appropriate verbs, recognize active vs. passive voice, resolve anaphora (…it

causes…)

| 25

Gene-disease relationships?

| 26

Relationships to p53

| 27

Author teamsIn HIV research?

| 28

Indirect links fromleukemia to Alzheimer’s via enzymes

| 29

Long-term: answer any question Must recognize multiple (any) entities and relationships Must recognize all forms of linguistic relationship Must have background of common sense information (or enough

entities/relations?) Information on donors (to political parties)

For now, building text miners, domain by domain, is perhaps the best we can do

Can build on preceding pieces…e.g., if you know drugs, diseases and drug-disease causation, can try to recognize ‘advancements in drug therapy’

| 30

Summary Federated search addressed problems of a different time

Had a highly fragmented search space, limitations of individual db’s, technical and interface problems and need to just get basic answers

Today’s search environment is increasingly centralized and robust Range of content and demands of users continue to increase Adequate search is a given…really good search is a challenge

best served by new technologies that don’t fit into a least-common-denominator framework

Need to locate best documents (sophisticated ranking, clustering) Need to answer complex questions Need to go beyond search for overviews, relationship discovery

Alternatives to Federated Search - Presented by: Marc Krellenstein Date: July 29, 2005.

Documents

search databases

parallel search

distributed search slide

common search service

hard search questions

costs of federated search

search quality high

hits of search terms