Top Banner
Search Engine Industry Trends Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004
28

Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Mar 28, 2015

Download

Documents

Neil Oyler
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Search Engine Industry Trends – Impact for Digital Libraries

Dr. John M. Lervik, CEO FAST

7th International Bielefeld Conference 2004

Page 2: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Oslo

BostonTokyo

Munich

San Francisco

Chicago

Rome

London

Washington DC

Rio de Janeiro

Fast Search & Transfer (FAST)

Since 1997, FAST has grown globally

– Public company (OSE: ’FAST’)

– 200+ employees, 80 in R&D

– Profitable and well capitalized

– Fast growing

• > 900 customers & partners (Univ. Lib Bielefeld, HBZ, ZIB, Norwegian Nat’l Lib, Elsevier, LexisNexis, etc)

• #2 growing company in Europe 1998-2002

– Internet business sold to Overture/Yahoo!– Acquired AltaVista software w/200 customers

“Industrial Strength”

“Magic Quadrant: Most Visionary”

“Excellent Choice”

New York

Tromsø

Page 3: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Mission-Critical Business Search

• Search has become mission-critical & strategic:– Internet portals: Google, MSN, Yahoo!, …– E-commerce: Amazon, eBay, …– Corporate web sites: Dell.com, IBM.com, ...– Yellow Pages: SEAT PG, TPI PA, Findexa, …– Directory services: Thomas Publishing, Bonnier…– Mobile: Vodafone live!, …

• Common purpose: Connect buyer with seller

Page 4: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Search Trends

• ”The Google effect”– Users demand simple one-field search– Users demand relevant results– Paid search (advertisement) is the main business driver

• Challenge: Search is much more difficult in academic and corporate world– Need to provide the relevant (correct) answer– Web search: Provide a relevant answer

• Solution: 3rd generation search technology– Improved relevance through content and query analysis– Tools for navigation, discovery, and visualization

Page 5: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Digital Library Challenges

• Digital libraries face an information management challenge– Huge and increasing amount of digital data

– Data/content aggregation, data store (repository), information retrieval & discovery, etc

• Increasing volumes and types of digital data– Media types: Books, magazines, CDs, ...

– Media formats: Text/numbers (incl metadata), audio files, images, video

– Must support various access patterns, copyright, etc

• Need flexible and efficient interfaces between information and users– Search engine as unified information access layer

Page 6: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Current Role of Search- Point Solutions

SITE SEARCH

Intranet Documents

SITE SEARCH

Intranet Documents

eMail

MailSystem

MAIL SEARCH

Documents

DMS,CMS

DMS SEARCH

RDBMS

ERP, CRM

Legacy Data

Datawarehouse

Datamarts

BI SEARCHCORPORATE

SEARCHECOMMERCE

SEARCH

The CorporationThe Corporation

IsolatedIsolatedSolutionsSolutions

Page 7: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

… to a Horizontal Search Platform…

RDBMS(JDBC, ODBC,

SQLNet, DW, DM)

Applications(e.g. ERM, CRM,

Help Desk)

Legacy Data(e.g. ISAM, VSAM, IMS)

Message Queues(e.g. TIBCO, MQ-Series)

DMS(e.g. M’Soft CMS,

Documentum)

eMail Systems(e.g. Notes,Exchange)

Files(e.g. Word, Excel,pdf, images, mp3)

Portals(e.g. WebSphere,

WebLogic)

WWW(HTML, XML, WML,

JavaScript)

Private Webs(e.g. news feeds,

Intranets)

Direct Push

UNSTRUCTUREDSTRUCTURED REAL--TIME

Enterprise Search Platform

SITE

SEA

RC

H

MA

IL S

EAR

CH

BI S

EAR

CH

DM

S SE

AR

CH

CO

RPO

RA

TESE

AR

CH

ECO

MM

ERC

ESE

AR

CH…

A common, unified service for intelligent, dynamic information retrieval

• Web services• GRID computing

Page 8: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Search EngineHow It Works

CO

NN

ECTO

RS

Pipeline

SEARCH

QU

ERY &

RESU

LTPR

OC

ESSING

FILTER

Query

Results

Alert

VerticalApplications

Portals

CustomFront-Ends

MobileDevices

DATABASECONNECTOR

FILETRAVERSER

WEBCRAWLER

ContentPush

DO

CU

MEN

TPR

OC

ESSING

Pipeline

WebContent

Files,Documents

Databases

CustomApplications

CO

NN

ECTO

RS

TUNING, ADMINISTRATION

Index Files

Pipeline

Multimedia

Open, modular, scalable architecture

Page 9: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Search EngineHow It Works

• Connect to content sources and get data– Web pages (e.g. XML, HTML, WML): Crawler– Files, documents (e.g. Word, Excel, pdf): File traverser– Database content (e.g. Oracle, DB2): Database connectors– Applications (e.g. Notes, Exchange, CMS/DMS): Application connectors

CO

NN

ECTO

RS

Pipeline

SEARCH

QU

ER

Y &

RE

SU

LT

PR

OC

ES

SIN

G

FILTER

Query

Results

Alert

VerticalApplications

Portals

CustomFront-Ends

MobileDevices

DATABASECONNECTOR

FILETRAVERSER

WEBCRAWLER

ContentPush

DO

CU

MEN

TPR

OC

ESSING

Pipeline

WebContent

Files,Documents

Databases

CustomApplications

CO

NN

ECTO

RS

TUNING, ADMINISTRATION

Index Files

Multimedia

Page 10: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Search EngineHow It Works

• Analyze and index content to make it searchable– Convert and process content through pre-processing pipeline:

• Lemmatization, entity extraction, taxonomy classification, ontology• Custom logic (e.g. adding special tags)

– Write content to index files

WebContent

CO

NN

ECTO

RS

Pipeline

SEARCH

QU

ERY /R

ESULT

PRO

CESSIN

G

FILTER

Query

Results

Alert

VerticalApplications

Portals

CustomFront-Ends

MobileDevices

DATABASECONNECTOR

FILETRAVERSER

WEBCRAWLER

DO

CU

MEN

TPR

OC

ESSING

Pipeline

CO

NN

ECTO

RS

TUNING, ADMINISTRATION

Index Files

Files,Documents

Databases

CustomApplications

ContentPush

Pipeline

Multimedia

Page 11: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Search EngineHow It Works

• Analyze query– Use query language or query API– Convert and process query through query pipeline:

• Linguistic processing• Custom logic (e.g. query term modification/addition)

WebContent

CO

NN

ECTO

RS

Pipeline

SEARCH

QU

ERY

PRO

CESSIN

G

FILTER

Query

Results

Alert

VerticalApplications

Portals

CustomFront-Ends

MobileDevices

DATABASECONNECTOR

FILETRAVERSER

WEBCRAWLER

ContentPush

DO

CU

MEN

TPR

OC

ESSING

Pipeline

CO

NN

ECTO

RS

TUNING, ADMINISTRATION

Index Files

Files,Documents

Databases

CustomApplications

Multimedia

Page 12: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Search EngineHow It Works

• Match query to content index– Query- and content adaptive matching– Exploit all information and structure in the data

CO

NN

ECTO

RS

Pipeline

SEARCH

QU

ERY /R

ESULT

PRO

CESSIN

G

FILTER

Query

Results

Alert

VerticalApplications

Portals

CustomFront-Ends

MobileDevices

DATABASECONNECTOR

FILETRAVERSER

WEBCRAWLER

DO

CU

MEN

TPR

OC

ESSING

Pipeline

CO

NN

ECTO

RS

TUNING, ADMINISTRATION

Index Files

WebContent

ContentPush

Files,Documents

Databases

CustomApplications

Pipeline

Multimedia

Page 13: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

CO

NN

ECTO

RS

Search EngineHow It Works

• Return results to user– Convert and process results through result pipeline:

• Resort, filter for security, analyze for navigation and discovery (dynamic drilldown)– Pass results on to application (generated or through API) – Push results to alert engine and then external environment (e.g. mail, queue)

WebContent

Pipeline

SEARCH

RESU

LTPR

OC

ESSING

FILTER

Query

Results

Alert

VerticalApplications

Portals

CustomFront-Ends

MobileDevices

DATABASECONNECTOR

FILETRAVERSER

WEBCRAWLER

ContentPush

DO

CU

MEN

TPR

OC

ESSING

Pipeline

CO

NN

ECTO

RS

TUNING, ADMINISTRATION

Index Files

Files,Documents

Databases

CustomApplications

Multimedia

Page 14: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Search Engine FeaturesRelevant, Organized Information

• Linguistic Analysis– Auto-language detection– Natural language processing– Approximate matching (spelling)– Lemmatization (grammar)– Entity extraction, anti-phrasing– Multiple dictionaries, thesauri

• Taxonomy and Classification– Structured, unstructured data– Supervised, unsupervised categorization– Dynamic classification– Auto-taxonomy generation (terms, Web)– Taxonomy toolkit– Ontologies

• Open, Flexible Relevancy Model– Absolute and relative query boosting– Relative document boosting– Custom processing logic (pre-index, query)– Rule-based matching

• Powerful Query Language– Exact matches, wildcards, multiple terms– “more like this” (query by example), “near”– Text, integer, Boolean expressions (infinite level of

parentheses– Integer comparisons (>, , =, <, , )– Fuzzy queries, concept,

• Flexible Search and Sort– Range searching– Default sort, sort by field– Static & dynamic teasers, any field– Full inclusion, exclusion URI control– Robot aware

• Navigation, Discovery & Visualization– Structure, unstructured data– Dynamic drill-down (faceted browsing)– Results-based binning– Statistical analysis

Page 15: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Relevance & Information Discovery

• Traditional: Results sets are typically lists of document identifiers

• 3rd generation: Result set depending on the query intentions– Traditional result set lists

– Dynamic clustering: Supervised and unsupervised

– Live analytics (dynamic drill-down) for navigation and discovery

– Visualization...

2 ways to search: - “I know what I want, but I don’t know where it is” - “I’m not sure what I’m looking for but I know how to get there”

2 ways to search: - “I know what I want, but I don’t know where it is” - “I’m not sure what I’m looking for but I know how to get there”

Intelligent OrganizationIntelligent Organization

The search barLive analytics

Page 16: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Traditional Result Set

• Languages– 77 languages auto-detectable, searchable,

sortable– 20 languages include advanced linguistics– Multiple code sets for each language

• Multiple field sorting

There are 2 ways to search for anything: - “I know what I want, but I don’t know where it is” - “I’m not sure what I’m looking for but I know how to get there”

There are 2 ways to search for anything: - “I know what I want, but I don’t know where it is” - “I’m not sure what I’m looking for but I know how to get there”The search bar

• Linguistics– Auto-language detection– Approximate matching (spelling)– Lemmatization (grammar)– Phrase detection– Anti-phrasing, stop words– Proximity search– Multiple dictionaries, thesauri– Full search language (incl. text, integer,

boolean)

Page 17: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Relevance: Ranking – The FCASQ Framework

• Completeness– How well does the query match superior contexts like the title or the url?– Example: query=”Mexico”, Is ”Mexico” or ”University of New Mexico” best?

• Authority– Is the document considered an authority for this query?– Examples: Web link cardinality, article references (citations), product revenue,

page impressions, ...

• Statistics– How well does the contents of this document on overall match the query?– Examples: Proximity, context weights, tfidf, degree of linguistic normaliz., etc

• Quality– What is the quality of the document? – Examples: Homepage?, Entry point to product group?, Press release?, ...

• Freshness– How fresh is the document compared to the time of the query?

Page 18: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Navigation & Discovery

There are 2 ways to search for anything: - “I know what I want, but I don’t know where it is” - “I’m not sure what I’m looking for but I know how to get there”

There are 2 ways to search for anything: - “I know what I want, but I don’t know where it is” - “I’m not sure what I’m looking for but I know how to get there” Live Analytics

• Multi-Dimensional Navigation– Taxonomic, ontological– Clustering of extracted entities– Field-based categories

• Dynamic, Automatic Generation– Auto-generated from configuration

definitions– Re-generated on each query– Internal scoring for further

refinement

Page 19: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Automatically Extracted Entities

Page 20: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Information DiscoveryExample: Scirus Metadata

Page 21: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Information DiscoveryExample: Medical Information (Medline) – 12M Documents

Discovery

• MESH keywords• Publication year• Journal Title• Author(s)• Chemical substances• Etc

Page 22: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Information DiscoveryExample: Medical Information – 12M Documents

Page 23: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Analytical SearchExample: Author Analysis

Data source:12M Medline Publications

Page 24: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Example: Echocardiography- Author drill-down

Jim Seward, Mayo

Jim Seward: Publishing pattern

A Tajik 56

J Oh 25

P Pellikka 16

B Khanderia 16

D Hagler 13

V Roger 13

K Bailey 13

F Miller 11

stress echocardiographyImage orientationregurgitant orificeabnormal relaxationtwo-dimensional echocardiographyventricular response in patientsinitial repairmitral lesionsechocardiographic contrastmyocardial infarction

Co-Authors Research Topics

Page 25: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Example 1: Scirus (www.scirus.com)

Scirus is the leading online search enginefor scientific content

ProprietaryDatabases

ValueAdded

Functionalities

ScientificWeb Pages

Twice winner of SEW Best Specialty

Search Engine award

140 million Web pages(.edu, .gov, .org, .com, …)

30M articlerecords (Medline, SciencDirect, …)

• Large-scale content aggregation• Automatic content & page classificat.• Query refinements (1-D drill-down)

Page 26: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

One integrated search engine across many diverse projects

– One search interface for all catalogs – instead of search in 100+ databases

– Information from objects of all types of media (multimedia, textual content, metadata)

– In-house library production systems, end-user services and in ongoing innovation projects

• Projects

– The Digital Radio Archive (DRA): NRK Radio historical radio archive (300,000 programs)

– Culture Net Norway: The official gateway to Norwegian culture on the web

– The Digital Newspaper Library: 300,00 pages from year 1763 and onwards

– Cultural Heritage Ekofisk: Content related Ekofisk oil field (incl. OAI metadata harvester)

– The National Library’s public web site

– Paradigma (Preservation, Arrangement & Retrieval of Assorted DIGital MAterials)

– The Nordic Web Archive (NWA): Harvesting and archiving of web documents

Example 2: Norwegian National Library (www.nb.no)

Page 27: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Summary

• Search engines can do more than just search…– Unified information access solution for digital libraries– Open, scalable and modular architecture: Allows for customization– Adapts to content and queries– Powerful data discovery, navigation, and visualization

• Many exciting technology developments to come– More advanced content and query analysis– Adaptive, personalized query- & content-sensitive matching– Dynamic result set presentation, navigation, discovery, visualization– Federation across external content applications

Page 28: Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Thank you!