Adding structure to unstructured content for enhanced findability hakan tylen

Do not reinvent Findability and Knowledge Management

Håkan Tylén Western Europe Business Development +46703091665 [email protected]

Agenda outline

Customer/Employee Service, in the Self-service channel

How can I help YOU?

Metadata basics What is it? Where is it stored?

Metadata is the set of properties that characterize a document.

Inconsistent, incorrect or missing metadata is commonplace within most organizations today

This impairs findability in the context of enterprise search

Hard to scan or navigate results

Documents returned may be incomplete or not current

No confidence in authority and correctness of information

Difficult to locate relevant experts

Poor metadata impairs the search experience Degraded findability leads to the erosion of users’ trust in search

I’m not confident I will

find what I need here…

This is a waste of time!

Unchanged template metadata make results

look like duplicates

Meaningless metadata confuses users as they scan the search results

Missing metadata raises questions about result

set completeness

Few options to navigate or refine a large result list other than trying to reformulate the query

Even with refinement tools, users do not rely on them

Multiple variations or spellings

Hit counts do not add up

6 | SharePoint Server 2010 for Internet Sites Microsoft confidential.

ROI - Scenarios

1. Time Wasted Searching

2. Cost of Reworking Information

3. Opportunity Costs to the Enterprise

7 | SharePoint Server 2010 for Internet Sites Microsoft confidential.

Scenario 1: Time wasted

€3.000/month + social €50.000/year

10 minutes/day *220 €1.000/emp/year

1000 employees = €1.000.000/year ”released time”

Creating quality metadata is a real challenge Few organizations have good quality metadata on internal content

• Ineffective information governance across the enterprise

• Multiple content silos and search interfaces

• Manually entered metadata is inconsistent, incorrect or missing

• No automated tools for content classification

• Impossible to keep up with ever growing content volumes

Challenge

Assist users in tagging

content with automated

metadata suggestions

or enrichment tools

• FAST Search for SharePoint (FS4SP) delivers business value out-of-the-box

• Sophisticated content processing optimizes findability across multiple silos

of unstructured and structured content

• In addition, property extraction overcomes poor metadata by generating it

and normalizing it on-the-fly

Solution

Agenda outline

The pipeline is a sequentially arranged set of discrete processing stages that break down and enrich content for indexing

Convert documents to plain text (support for 400+ file formats)

Detect document languages and encoding (support for 80+ languages)

Apply linguistic normalization to optimize content for search

Identify and leverage existing metadata where applicable

Parse content to extract or generate additional metadata

Map content and associated metadata (crawled properties) to the index schema (managed properties) for searching

Custom stages can be created and added to the pipeline

Content Processing Pipeline – what is it? Enhance your content for optimal search experience and findability

Properties

Mapper

Maps the relevant pieces of content and metadata

discovered in the pipeline to the index schema for search

Custom

Processing

Stage

Enables you to extend the content processing pipeline

with custom stages (home-grown solutions or 3rd party

software) to address your own business needs

Date and Time

Normalization

Converts dates and times to a standard representation, to

handle locale-specific representations; for example, the

date 14-Mar-10 is equivalent to March 14, 2010

Vectorization

Creates document vectors (phrase/weight pairs reflecting

important terms and frequency of occurrence) to enable

“find similar” functionality

Property

Extraction

Recognizes predefined entities mentioned in the content;

out of the box support for Companies, Locations and

People but this can be extended to other categories

Lemmatization

Applies language-specific normalization to content so

users’ queries match words and phrases in canonical or

inflected forms (singular/plural, masculine/feminine, etc.)

Tokenization

Breaks text into tokens using language-specific rules for

punctuation, diacritics, accents, compound words, phrases

and numbers (currency, telephones, part numbers, etc.)

Language

Encoding and

Detection

Identifies the encoding and languages used in the text

content so that the appropriate linguistic normalization

rules and dictionaries can be applied downstream

Format

Conversion

Extracts plain text and metadata from multiple content

formats (e.g. Microsoft Office, PDF, HTML, etc.)

Type Doc ID Title Author Date Size Keywords Companies Locations People ... Body Text

xxx Sales For… John Doe 2010-04-15 386 KB sales; pipe… Microsoft; … London; … Bill Gates; … … The mark…

yyy … … … … … … … … … …

zzz … … … … … … … … … …

Property Extraction Create metadata on-the-fly, adding structure to unstructured content

Locations

London

San Francisco

Moscow

…

People

Bill Gates

Barack Obama

José Caires

...

In a nutshell, property extraction is the ability to

Process unstructured content (e.g. a document’s body)

Recognize entities mentioned in the text (e.g. people, companies, locations, concepts, etc.)

Optionally, normalize variations to a single, canonical form

Expose these extracted entities as crawled properties in pipeline

Map them to managed properties for filtering and searching

Crawled Properties

Managed Properties Index Schema:

Companies

Microsoft

Contoso

Woodgrove

…

Metadata quality is critical to the search experience

FS4SP leverages metadata, i.e. managed properties, to present deep refiners

Offer at-a-glance overview

Organize free-text search results into multiple facets

Make search conversational

Guide users toward possible refinement choices

Prevent users drilling down into a “0 results” dead end

Additional uses for managed properties in FS4SP

Relevancy tuning & ranking

Multi-level sorting

Advanced (or fielded) search

Good metadata greatly improves findability Property extraction enables consistent metadata across all content

This is really great! Now I

can navigate through this

large information universe

without feeling lost…

Precise hit counts in deep refiners are

computed across the whole result set.

And many more…

Concepts

Products

Companies

File Formats ,

Metadata is also used for relevancy tuning,

multi-level sorting and advanced search

| 13

- Americas - - Europe - Middle East -

- Africa - - Asia Pacific -

Seattle Dublin

Singapore

19.4 TB 127,986 Sites 345,935 Sub-sites



29.89 TB ( 31,346,042 MB ) Grows with 1.5TB per quarter

223,595 Sites 545,387 Sub-sites

65% 22%

13%

As of September 2010

The Microsoft IT Intranet Environment

Knowledge Transfer: MSW

http://msw/searchcenter/pages/default.aspx

FS4SP automatically detects 80+ languages in content

Property extraction dictionaries are included for 11 languages* and 3 types of entities

Locations

Companies

Persons

The metadata is exposed to users as refiners, drives relevancy and other features to improve findability

This delivers real business value to organizations struggling with issues such as

Poor document metadata

Large content volumes

Lack of result refinement options

Low user adoption of search

Property extraction and refiners in FS4SP What’s available out-of-the-box?

* Arabic, Dutch, English, French, German, Italian, Japanese, Norwegian, Portuguese, Russian, Spanish

http://infopedia/fastsearchcenter/pages/default.aspx

Property extraction in FS4SP is customizable using a dictionary, i.e. list of keywords and phrases

Matching variations can be normalized to a single entry

Several dictionaries may co-exist to address needs of the business

Projects

Products

Customers

Competitors

Employees

Business-specific concepts

The necessary data may be readily available within the organization or from external sources

Extending property extraction in FS4SP (1/2) Make search speak the language of your business using dictionaries

SharePoint lists & Term Store

LOB applications, Databases & XML

Create custom search refiners to fit your own business needs

External text mining/classification tool Another approach is to invoke external tools during content processing in FS4SP

This leverages the standard pipeline extensibility mechanism

Such tools typically address problems like

Text mining for entity, fact or relationship extraction

Taxonomy classification

Moreover, these tools may be already deployed for other purposes in the enterprise

Home-grown solutions

3rd party, specialized vendors Industry sectors or verticals

Scientific or technology domains

Extending property extraction in FS4SP (2/2) Use existing text mining or classification tools to go even further

?

Original document from repository

Analyze text content

Return metadata tags

Index Content pipeline

Enriched document for indexing

Web service Local software

Agenda outline

Best practice #1 Deepen your understanding of your audiences and your content

En

terp

rise

co

nte

nt

Before you start deploying enterprise search:

understand your content, your users and what

they need to get their jobs done effectively.

Marketing Sales Procurement Consulting Research HR / Legal IT Support Production

Best practice #2 Use existing language resources inside and outside your enterprise

Inte

rnal

ass

ets

•Thesauri, controlled

vocabularies

•Taxonomies,

ontologies

•Master databases

•Enterprise systems

•Line-of-business

applications

•Subject matter

experts

•Examples*

•SharePoint (Lists,

Term Store)

•Employees (AD, HR)

•Customers (CRM)

•Suppliers (ERP)

•Products (PLM)

•Processes (BPM)

•Projects (EPM)

Inte

rnet

reso

urc

es •Government

agencies

•Industry bodies

•Research

institutions

•Academia

•Virtual

communities

•Examples

•Wikipedia.org

•DBpedia.org

•WordNet, from

Princeton University

•Medical Subject

Headings (MeSH) C

on

ten

t p

rovid

ers

Sp

ecia

lized

ven

do

rs

* AD – Active Directory; CRM – Customer Relationship Mgmt.; ERP – Enterprise Resource Mgmt.; PLM – Product Lifecycle Mgmt.; BPM – Business Process Modeling; EPM – Enterprise Project Mgmt.

The language of the business will change over time

External environment

Enterprise content

Users’ needs

Ensure that property extraction dictionaries and search index are systematically updated to respond to these changes

Where possible, automate dictionary upkeep as part of standard business workflows

Taxonomies and thesauri

Enterprise project management

Product lifecycle management

Schedule regular analysis and review checkpoints to handle exceptional cases

Best practice #3 Keep the index synchronized with content sources and dictionaries

Property

Extraction

Dictionaries

Search

Index

Dictionary

Data

Sources

Enterprise

Content

Sources

Searc

h sy

nch

ron

ized

with

ch

an

ges o

ver tim

e

As the language of your business and users’ needs evolve, so should your search solution

If not, the search experience and findability inevitably degrade over time – users’ trust will plunge too

Search management is not an IT responsibility, it’s for the business

Best practice #4 Distinguish search management from systems management

Original implementation of the search solution

Actual search experience,

if left unattended...

• Skillset of a SharePoint administrator (not a

programmer or systems engineer)

• Business perspective and focus

• Good ability with languages

• Attention to detail

Job profile

• Monitor search reports (daily/weekly)

• Run user polls and/or focus groups

(quarterly)

• Process users feedback/questions

• Update dictionaries and manage keywords

(as required)

• Support search-related projects

Sample tasks

• One person part-time, or

• A geographically distributed team

Staffing – depends on scale

Agenda outline

• Researchers forced to search each internal and

external content source separately

• Low relevancy in existing search applications

• High effort in information discovery tasks

• Growing difficulty in establishing connections with

experts as company grew worldwide

Business Problem

• FAST Search for SharePoint indexes all internal

sources and federates external industry services

• Property extraction dictionaries extended to

recognize product names cited in documents

• Deep refiners are used on extracted properties to

drill down by products, companies and people

Approach & Solution

• Improved employee productivity with more relevant

search results in a unified interface

• Greater information sharing and reuse across

product areas & geographies

• Integrated people search eases social networking

• Proof point for wider search roll-out in enterprise

Benefits & Value

Case study #1 General Mills (Research & Development)

By using FAST Search Server 2010 for SharePoint, our researchers can refine their searches and find exactly what they are looking for. They spend more time innovating than looking for information.

– Michelle Check, R&D Systems Leader, General Mills Link to full case study

http://www.microsoft.com/caseStudies/Case_Study_Detail.aspx?casestudyid=4000007255

• Poor access to a large, active collection of paper-

based contracts and project documents

• Metadata managed in a separate DMS (database)

• Information silos stifle and sharing of data and

collaboration

• Requirements to provide internal and public access

Business Problem

• FAST Search for SharePoint indexes images with

iFilter-based OCR technology

• Pipeline extended with custom .NET code to merge

metadata from database with indexed documents

• Custom refiners reflect language used in the

business for navigating search results

Approach & Solution

• Unified self-service interface to locate information

• Ability to slice & dice results according to specific

needs (dates, project, folder, route, district, etc.)

• Information search times cut from several hours or

days to mere seconds or minutes

• Users have more time to focus on higher value tasks

Benefits & Value

Case study #2 Mississippi Department of Transportation (MDOT)

We are literally reducing decision cycles from days to minutes for hundreds of overlapping decisions a day. With SharePoint Server 2010, we can make better spending decisions and enhance program performance without a very large investment.

– John Michael Simpson, CTO, MDOT Link to full case study

http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=4000007073

The challenges

• Explosive content growth puts information management and

governance under pressure

• Multiple content silos with different search interfaces

• Poor metadata – missing, inconsistent, incorrect

The solution

• Content processing optimizes findability across disparate sources

• Property extraction generates metadata while indexing content

• Deep refiners expose metadata in search results helping users

quickly zoom to the right information

The benefits

• Reduced costs through enterprise search consolidation and

automated metadata enrichment

• Enhanced findability helps employees to get their job done faster

• Increased user adoption across the enterprise drives ROI

Ingredients for great enterprise search The business value of FAST Search Server 2010 for SharePoint

© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market

conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

microsoft.com / Enterprise Search

Adding structure to unstructured content for enhanced findability hakan tylen

Business

extending

managed properties

sharepoint

search

metadata

time

content

business