Top Banner
1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006
93

1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

Mar 27, 2015

Download

Documents

Aaliyah Dalton
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

1

Ontologizing the Ontolog Content

Protégé Workshop

Denise A. D. Bedford, Ph.d.July 23, 2006

Page 2: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

2

TaxoThesaurus Background• Ontolog TaxoThesaurus Working Group aims to:

– Establish a framework for developing an ontology that will focus on the current and future content of the Ontolog community, support a range of uses of the Ontolog and Ontolog-referenced content, by Ontolog members and non-members

– Provide a sustainable foundation for future variations in content, use and users - which is extensible without radical re-engineering going forward

– Provide a framework against which a basic set of functional architecture requirements can be defined – June discussion

– Provide a framework against which various semantic technologies might be positioned to support Ontolog - April and June discussions

2

Page 3: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

3

TaxoThesaurus Background

• Provide a basis for a case study in collaborative practice domain ontology development and management

• Provide a comparison – along the way – of the various ontology reference models

• If the group wishes – along the way – provide the community with guidance in positioning semantic solutions vis a vis semantic problems

3

Page 4: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

4

Goal is not to…

• Advocate one particular semantic approach over others because they all serve different purposes

• Provide a survey of or evaluate the individual technologies on the market today

• Suggest that any one person has a solution that works for everyone

4

Page 5: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

5

Presentation Overview

• Description of the Bottom Up Strategy

• Status Update on the Ontolog Ontologizing Work

• Expert Review and Evaluation of Ontolog Outputs (Workshop begins with this step…)

• Discussion of Next Steps and Invitation to Participate

Page 6: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

6

Part I. Description of the Bottom Up Approach

Page 7: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

7

Workshop Overview• Step 1. Describe the domain by brainstorming use, users and content

to be ontologized

• Step 2. Identify the parameters of the ontology and describe their behavior

• Step 3. Identify semantic methods to support individual parameters

• Step 4. Take stock of architectural considerations

• Step 5. Generate values for the ontology parameters

• Step 6. Coordinate review and validation of ontology values

• Step 7. Operationalize ontology

Page 8: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

8

Ontology Value Chain

DescribeDomain

Ontology Parameters

SemanticMethods

ArchitectureIssues

Generate Values

Expert Review

OperationalizeOntology

Content,Use, Users

Definitions of Entities,AttributesClasses,Relationships

Strategy forGeneratingValues

ApplicationRequirements

Ontology Raw Value Creation

Ontology Refined Value Creation

WorkingOntology

Page 9: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

9

Step 1. Describing the Domain

Page 10: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

10

Describing the Domain• Ontologies may involve semantic analysis, technologies,

architecture, categorization, concept mapping and so forth

• Ultimately, though, they are about describing a domain so we should always begin with a general definition of the domain

• Domain may be a line of business or what we typically consider a subject domain

• Today we’ll work with the domain of ontologies - we will recognize very quickly that defining the domain is not easy and we will not agree on the definition until we have worked through several additional steps

Page 11: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

11

Ontolog Domain

• We have focused the TaxoThesaurus project on the domain of ontologies since we have subject matter experts to work with and since we only have a few hours to walk through this exercise

• We will also take a bit of a shortcut and use the same boundaries to define our Ontology domain as are used to define the Ontolog Community of Practice

Page 12: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

12

Describing the Domain

• As a starting point, let’s work with framework which contains three essential components that will help us to better describe our domain:– Domain Content – Users– Use/processes

• These basic reference points should help us to identify several scenarios and to understand the basic functional requirements our ontology will have to satisfy

12

Page 13: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

13

The Context for an Ontolog Ontology

Users Use or Function

Information (Document)

Context

13

Page 14: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

14

Users

• May seem like the easiest dimension to address – but we need to make sure we have the same goals for the Ontolog ontology

• Do we assume that only Ontolog active members will be served by the ontology?

• Or, do we support all members and the general public who might be interested in joining the community or who might find the wiki content a valuable resource for learning?

• Are we assuming only ontolog-sophisticates or do we include general managers, novices, general public interest?

14

Page 15: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

15

User CommunityWho Domain Knowledge Roles

Ontolog Member Wiki Wiki Manager

Ontolog Member/Non-Member

Ontology research & development

Researchers, discussants, presenters, novices

Ontolog Member/Non-Member

Computational linguistics Researchers, discussants, presenters, novices

Ontolog Member/Non-Member

Standards development work

Participants, vendors, observers, implementors

Ontolog Member/Non-Member

Metadata Creators, users, semantics developers, computational linguists

Ontolog Member/Non-Member

Taxonomies Creators, designers, users, semantics developers, computational linguists

Ontolog Member/Non-Member

Information Architecture Engineers, information scientists

Ontolog Member/Non-Member

Semantic Technologies Developers, users, implementors, linguists, novices

15

Page 16: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

16

Use and Context• It is challenging for people who are so familiar with ontology

development and semantic technologies to step back and think about how an ontology would actually support our use of the Ontolog content

• But, this is a critical first step – without understanding the use and context, we cannot establish a baseline ontology

• Without understanding use and context we will forever argue about which model works best, which tools work best and who should do what – actually, there is room for variation and negotiation here

• Following tables are the result of some brainstorming and observations from the Ontolog community itself

16

Page 17: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

17

Possible Uses of Ontolog Content

Doing What

Find Person who knows something about an issue

Browse Issues that Ontolog has discussed

Find All people who participated in a discussion

Learn About Reference models discussed by Ontolog

Get list of Problems Ontolog identified that need attention

Browse Collections by topic

Search Future conference call topics

17

Page 18: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

18

Possible Uses of Ontolog Content

Doing What

Search Next scheduled call

Search Specific email message

Find List of all members of Ontolog

Find Specific Ontolog member

Find Reference to ontology standards

Find Book references

Find Organizations working in this area

18

Page 19: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

19

Possible Uses of Ontolog Content

Doing What

Find Upcoming conferences & participants

Generate Knowledge map of who knows what in Ontologies

Generate Map of the social networking in Ontolog

Publish review of a new book

Start Discussion of a new topic

Annotate/summarize Discussion thread

Others?? Others??

19

Page 20: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

20

Content• In order to understand the content the Taxo-Thesaurus Project Team

ran an inventory of all the content published to or contained in the Wiki

• The Ontolog Use Case surfaced over 65,000 content objects (some of which were versions of the same object)

• The inventory gave us a better sense of the kinds of content available in the domain and an accurate picture of what was covered in the existing repository

• We discovered that not all of the kinds of content that belong to the broader domain of Ontologies are accessible in or from the Wiki, though.

• This includes people, organizations, as well as other kinds of static information content

• We need to add content to the domain

20

Page 21: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

21

Sample Coast Content Inventory

Page 22: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

22

First Cut at Ontolog Content

• Ontolog People profiles/pages• Ontolog presentations• Ontolog discussion threads• Ontolog concepts• Ontolog Activity Calendar• Ontolog Conference call notes• Ontolog Conference call

agendas• Ontolog Conference call

minutes• Ontolog Conference call

transcripts• Email messages• Discussion threads/forums

• Professional Conference schedules & announcements

• Professional Conference representation

• Books on ontology topics• Published articles on ontology

topics• Reviews of books on

ontologies• Ontology standards• Professional organizations• Research institutions • Wiki search logs

22

Page 23: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

23

Describing the Domain

• At the end of this Step we have a basic idea of what kinds of content the ontology will have to cover, the kinds of entities it will have to include, and the kinds of relationships and concepts that will be needed to support functionality

• We are now ready to begin to specify the parameters of the ontology

Page 24: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

24

Step 2. Identify the Parameters of the Ontology

Page 25: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

25

Definition of Ontology

• “Data model that represents a domain and is used to reason about the objects in that domain and the relationships between them. Ontologies are used in artificial intelligence, the semantic web, software engineering and information architecture as a form of knowledge representation about the world or some part of it. Ontologies generally describe:– Entities– Classes– Attributes– Relations”

(source – Wikipedia)

Page 26: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

26

Content Entity

Definition

Content Elements

Content

MetadataProfile

Ontolog Topic Class Scheme

Authority Control – Member Names

Thesaurus of Ontolog Concepts

Areas of Expertise

Authority Contro –Organizations

Has values

usesHas

Contains

UserHas relationship to

Has Meaning in

Use

ContextualMatrix &Sensiing

Understood in

uses

Profile

Has

Business Rule

Has

Ontology Architecture Begins to Ontology Architecture Begins to EmergeEmerge

Has values

Content Elements

Has

Content Model

Has

Aggregation Levels

26

Page 27: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

27

Entities in the Ontolog Domain Include…

• People • Institutions• Communities of Practice• Journal articles• Books• Discussion threads• Presentations• Standards• Project proposals• Memoranda of

Understanding

• Conference announcements

• Conference presentations• Research grant program

descriptions• Research reports• Conference call notes• Conference call tapes• …..many others

Page 28: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

28

Attributes Include...

• Ideally, we would model all of these entities at least at a high level

• The models of these entities would include:

– Attributes of entities as structured content (structured data)

– Content elements (semi- or unstructured content)

– Value added metadata

Page 29: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

29

More Advanced Entity Models• When we began describing content about ten years ago, we went to

a more granular level

• We defined data models for our entities

• Ideally, you will also take the effort to the entity data model level

• Following is an example of a data model for a communique

• We also defined data models for people, institutions, countries, projects, many types of knowledge, for document types, communications (drawing on news schema), etc.

• Taking it to this level enables you to apply the ontology at a more granular level and to increase the goodness of your application

Page 30: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

30

Content Data Model Example – Event, Communique30

Page 31: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

31

In order for Ontolog to support….

• Search

– We need to know the parameters users will search by (for who, what, where, when, how…)

– We need to understand the behavior and semantic challenges of those parameters (author names and variations, affiliations, facets of domains, dates, …)

• Knowledge mapping of Ontolog members

– We need to know who is a member of the Ontolog CoP– We need to know general areas of expertise in order to describe the

mebers consistently knowledge– We need to know their names and variations of their names, – We need to know their affiliations (organizational names)

Page 32: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

32

In order for Ontolog to support….• Navigation/browse by novices and experts

– We need to know how to organize the content for easy access. By domain facet? By topic? By country?

– How to organize facets to facilitate expert and novice access. – How to maintain the reference sources that support facets.

• Easily access at the concept level by managers and others who may not have technical expertise…

– What we discovered when we did the inventory was that 90%+ of the Ontolog content is technical in nature

– Our expectation that non-technical managers would use the content to understand the value of ontologies does not hold now

– We need to include more non-technical content and we need to bridge the technical/non-technical vocabulary

Page 33: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

33

Understanding Semantic Behavior of Attributes

• My experience suggests that before we can successfully apply semantic technologies in an ontology context, we need to understand the behavior of the attributes

• There are many different kind of semantic methods and it is important to match the right solution to the problem

• Let’s think about some of the semantic challenges we find in some typical attributes– Person’s name– Organization name– Country name– Class scheme– Concepts

Page 34: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

34

People Name Challenges

• People names vary in different ways

• Over time as names change with life events– Denise Ann Dowding– Denise Ann Dowding Bedford

• In their format depending on context– D. Bedford– D. A. D. Bedford– Denise D. Bedford– Denise A. Bedford– Denise A. D. Bedford

• Common versus formal names– Denny vs. Denise– Raju vs. Rajendra vs. Natarajan

• Need to link all semantic equivalents in the ontology

Page 35: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

35

Class Schemes & Classification Problems

• Class schemes – Have inheritance structures which must be respected– Classes may experience scope changes– Classes may appear or be archived over time– May be insufficiently comprehensive in coverage of the domain

• Classification– Human classification tends to suffer from inconsistencies due to limited

perspectives, variations in perception, and variations over time

• Classes need to be comprehensively represented across the domain and managed consistently over time

• Classification needs to be performed consistently

Page 36: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

36

Geographic Names

• Variations in country names occur

• Over time as political context changes– Armenia– Soviet Socialist Republic of Armenia

• By perspective and tradition – New Delhi or Chennai– Mombay or Bombay

• All variations need to be linked as equivalencies or they need to be linked as predecessor/successor forms in an authority controlled context

Page 37: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

37

Concept Challenges

• Primary challenges with concepts are based on:

– Concept as a word unit – as defined in dictionaries or word compendiums (WordNet)

• Girls, education• Sediment, transport

– Concept as a multiword unit – idea as identified in glossaries, thesauri

• Girls education• Sediment transport

• True concepts are defined at the multiword level

• Need to be able to understand the linguistic nature of the language in order to discover concepts

Page 38: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

38

Quick Taxonomy Primer

• Before we can begin to model and/or solve semantic problems programmatically, we need to understand the structure and behavior of taxonomies

• There are five types of taxonomies: – Flat taxonomies (controlled lists)– Hierarchical taxonomies (class schemes)– Ring taxonomies (synonym, equivalencies)– Network taxonomies (thesauri, semantic networks)– Faceted taxonomies (aspects, metadata)

Page 39: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

39

Flat Taxonomy Structure

Energy Environment Education Economics Transport Trade Labor Agriculture

Page 40: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

40

Hierarchical Taxonomy

A hierarchical taxonomy is represented as a tree data structure in a database application. The tree data structure consists of nodes and links. In an RDBMSenvironment, the relationships become associations. In a hierarchical taxonomy, a nodecan have only one parent.

Page 41: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

41

Network Taxonomies

A network taxonomy is a plex data structure. Each node can have more than one parent. Any item in a plex structure can be linked to any other item. In plex structures, links can be meaningful & different.

Page 42: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

42

Poverty mitigation

Poverty alleviation

Poverty elimination

Poverty reduction

Poverty eradication

Poverty abatement

Poverty prevention

Poverty reducation

Ring Taxonomy

Rings can include all kinds of synonyms - true, misspellings, predecessors, abbreviations

Page 43: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

43

Facet Taxonomies

Faceted taxonomy representedas a star data structure. Eachnode in the start structure isliked to the center focus. Any node can be linked to other nodes in other stars. Appears simple, but becomes complex quickly.

Page 44: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

44

Step 3. Architectural Considerations

Page 45: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

45

Functional Architecture & Requirements

• The focus of the workshop is not to discuss the architecture to support an ontology

• Instead, we simply highlight this step to emphasize the importance of stopping at this point in the process to focus on how you will support use of the ontology

• This is where varying assumptions may cause a breakdown in agreements within groups

• Some may presume that an ontology will be applied on top of content dynamically

• Others may presume that the ontology will be embedded into a more formal enterprise architecture

Page 46: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

46

Functional Requirements Begin to Emerge

• At this stage functional requirements and architecture issues begin to surface. In the WB context, we realized we needed:

– Metadata schema – Different kinds of taxonomies (controlled lists, rings, hierarchies, concept

networks)– Semantic analysis tools to support metadata capture – Metadata encoding options (xml, rdf, etc.)– Metadata storage options (e.g. embedded in document, distinct database,

etc.)– Search system which supports attribute searching & which leverages

reference sources – Browse structure– Reporting– Data mining and clustering – Other more sophisticated inference and reasoning options to support

contextualization, business intelligence, and expert systems/inferencing engines and

46

Page 47: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

47

ConsolidatedMetadata

Store

Enhanced Common

DataStores*

MDRTools & Utilities

Metadata Extracts

Metadata Extracts

Metadata Loads

Metadata Maintenance Utilities

Security Policy

Change Mgmt.Processes

Utilities

Interface Templates

ParametricIndexes

IndexUtilities

Teragram Metadata Capture

SearchInterface

-Simple andFielded Search

ResultsDisplay &

Manipulation

QueryManipulation

Options

QueryProcessingAlgorithms

MetaModelRepository

RelationalMetaModel

BusinessMetaModel

Logical MetaModel

ApplicationMetaModel

Metadata RepositoryMetaModel

Including• transformation rules• reporting specs• loader programs• data standards• data rationalization

ContentAggregator

RecommenderEngine

ContentSyndication

PersonalizationProfiles

SocialOr TaskFiltering

ThresholdFiltering

Search tools

Enterprise Search Functional Architecture

JOLIS

MD

IRAMS

MD

Global JOLIS

MD

Image Bank

MD

LMS

MD

CMS

MD

IRIS

MD

Union Index

*includes thesaurus support and taxonomies

Vocabulary Support

ClassificationSchemes

CrossLanguageSearching

Page 48: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

48

Step 4. Identify Semantic Methods to Generate Ontology Values

Page 49: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

49

Reality of Ontolgy Values

• Ontologies are grounded on structures, definitions, relationships and VALUES

• Without VALUES you don’t have an ontology

• The problem is that generating values is very resource intense and no one has sufficient human resources to support this work

• Solution is to leverage semantic technologies to generate values for ontologies

• As we saw in Step 3, there are different kinds of semantic problems that require different kinds of solutions

• Challenge is finding the right semantic solution to fit the semantic problem

Page 50: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

50

Ontolog Values

• Today we will share with you for your review and critique some programmatically generated values for entities, attributes, concepts and reference sources

• Before we do that, though, we’d like to describe how we used semantic technologies to generate the outputs

• Before we describe the technologies and how we used them, though, it might be important to distinguish two basic types of approaches

Page 51: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

51

NLP Technologies – Two Approaches

• Over the past 50 years, there have been two competing strategies in NLP - statistical vs. semantic

• In the mid-1990’s at the AAAI Stanford Spring Workshops it was agreed by the active practitioners that the statistical NLP approach had hit a rubber ceiling – there were no further productivity gains to be made from this approach

• About that time, the semantic approach showed practical gains – we have been combining the two approaches since the late 1990’s

• Teragram supports both approaches but is a semantic technology at base – this is the best configuration and it provides the greatest flexibility.

Page 52: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

52

Statistical NLP

• Statistical Approach uses statistical regression and Bayesian modeling methods to find patterns in words.

• This approach treats words as if they are ‘data’ – it breaks text down into single-word tokens and then tries to find similar tokens. There is no attempt to understand or detect meaning in the words – they are only characters/digits in strings.

• It then runs statistical analysis to find ‘co-occurring tokens’

• The problem with this approach is that it works only at the word or word fragment level and you never get to a higher level of understanding from this baseline.

• This approach helps you to learn that ‘girls’ and ‘education’ are related – but, we don’t need a statistical tool to tell us this – we already know this and can represent it as a concept (vs. a word)

Page 53: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

53

Problem with Statistical NLP• We experimented with several of these tools in the early 2000s –

including Autonomy, Semio, Northern Lights Clustering

• We saw the following known effects --

– the statistical associations you generate are entirely dependent upon the frequency at which they occur in the training set

– Without a semantic base you cannot distinguish types of entities, attributes, concepts or relationships

– If the training set is not representative of your universe, your relationships will not be representative and you cannot generalize from the results

– If the universe crosses domains, then the words that have the greatest commonality (least meaning) have the greatest association value

Page 54: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

54

Semantic NLP

• For years, people thought the semantic could not be achieved so they relied on statistical methods

• The reason they thought it would never be practical is that it took a long time to build the foundation – understanding human language is not a trivial exercise

• Building a semantic foundation involves:– developing grammatical and morphological rules – language by

language– Using parsers and Part of Speech (POS) taggers to semantically

decompose text into semantic elements– Building dictionaries or corpa for individual languages as fuel for

the semantic foundation to run on– Making it all work fast enough and in a resource efficient way to

make it economically practical

Page 55: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

55

Example of Semantic Analysis

Page 56: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

56

Getting Semantic with Computational Linguistics

• Computational linguistics is an interdisciplinary field dealing with the logical modeling of natural language from a computational perspective

• Computational linguistics puts the semantic in natural language processing.

• Computational linguistics predates artificial intelligence - originated with efforts in the United States in the 1950s to have computers automatically translate texts in foreign languages into English, particularly Russian scientific journals.

• This work was finally brought to a practical level in the 1980s with the joint NASA-Russian Soyuz Space Station work. The first product we looked at in 1998 was NASA’s MAI toolset

• It has taken us 50 years to get where we are today – and Teragram provides us with some practical NLP capabilities.

Page 57: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

57

How We Used the Semantic Technologies

• Teragram is a set of multilingual natural language processing (NLP) technologies that use the representation and meaning of text to distill relevant information from vast amounts of data.

• Teragram’s Natural Language Processing technologies include:– Rules Based Concept Extraction (also called classifier)– Grammar Based Concept Extraction– Categorization– Summarization – Clustering– Language detection

• The package consists of a developers client (TK240) and multiple servers to support the technologies

• We have taken this basic ‘technology toolkit’ and implemented it in a way that supports programmatic metadata capture and is consistent with good practice data quality and data management

Page 58: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

58

Rule Based Concept Extraction

• What is it?– Rule based concept or entity extraction is a simple pattern recognition

technique which looks for and extracts named entities– Entities can be anything – but you have to have a comprehensive list of

the names of the entities you’re looking for

• How does it work?– It is a simple pattern matching program which compares the list of

entity names to what it finds in content– Regular expressions are used to match sets of strings that follow a

pattern but contain some variation– List of entity names can be built from scratch or using existing sources

– we try to use existing sources– A rule-based concept extractor would be fueled by a list such as

Working Paper Series Names, edition or version statement, Publisher’s names, etc.

– Generally, concept extraction works on a “match” or “no match” approach – it matches or it doesn’t

– Your list of entity names has to be pretty good

Page 59: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

59

Rule Based Concept Extraction

• How do we build it?1. Create a comprehensive list of the names of the entities – most of the

time these already exist, and there may be multiple copies 2. Review the list, study the patterns in the names, and prune the list3. Apply regular expressions to simplify the patterns in the names4. Build a Concept Profile 5. Run the concept profile against a test set of documents (not a training

set because we build this from an authoritative list not through ‘discovery’)

6. Review the results and refine the profile

• State of Industry – The industry is very advanced – this type of work has been under

development and deployed for at least three decades now. It is a bit more reliable than grammatical extraction, but it takes more time to build.

Page 60: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

60

Rules Based Concept Extraction Examples

• Loan #• Credit #• Report #• Trust Fund #• ISBN, ISSN• Organization Name

(companies, NGOs, IGOs, governmental organizations, etc.)

• Address• Phone Numbers

• Social Security Numbers• Library of Congress Class

Number• Document Object Identifier• URLs• ICSID Tribunal Number• Edition or version statement• Series Name• Publisher Name

Let’s look at the Teragram TK240 profiles for Organization Names, Edition Statements, and ISBN

Page 61: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

61

Replace this slide with the ISBN screen – with the rules displayedConcept based rules

engine allows us to define patterns to

capture other kinds of data

ISBN Concept Extraction Profile – Regular Expressions (RegEx)

Use of concept extraction, regular expressions, and

the rules engine to capture ISBNs.

Regular expressions match sets of strings by pattern, so we don’t need to list every exact ISBN we’re

looking for.

Page 62: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

62

Classifier concept

extraction allows us to look for exact string

matches

List of entities matches exact

strings. This requires an exhaustive list–

but gives us extensive control. (It would be difficult to

distinguish by pattern between IGOs and other

NGOs.)

Page 63: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

63

Another list of entities matches

exact strings. In this case, though, we’re making this into an ‘authority control

list’– We’re matching multiple strings to the one approved

output. (In this case, the AACR2-approved edition statement.)

Page 64: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

64

Grammatical Concept Extractions

• What is it?– A simple pattern matching algorithm which matches your specifications to the

underlying grammatical entities– For example, you could define a grammar that describes a proper noun for

people’s names or for sentence fragments that look like titles

• How does it work?– This is also a pattern matching program but it uses computational linguistics

knowledge of a language in order to identify the entities to extract – if you don’t have an underlying semantic engine, you can’t do this type of extraction

– There is no authoritative list in this case – instead it uses parsers, part-of-speech tagging and grammatical code

– The semantic engine’s dictionary determines how well the extraction works – if you don’t have a good dictionary you won’t get good results

– There needs to be a distinct semantic engine for each language you’re working with

Page 65: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

65

Grammatical Concept Extractions

• How do we build it?– Model the type of grammatical entity we want to extract and use the

grammar definitions to build a profile– Test the profile on a set of test content to see how it behaves – Refine the grammars– Deploy the profile

• State of Industry – It has taken decades to get the grammars for languages well defined – There are not too many of these tools available on the market today but

we are pushing to have more open source– Teragram now has grammars and semantic engines for 30 different

languages commercially available– IFC has been working with ClearForest

• Let’s look at some examples of grammatical profiles – People’s Names, Noun Phrases, Verb Phrases, Book Titles

Page 66: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

66

TK240 Grammars for People Names

Grammar concept extraction allows us to define concepts based on semantic language

patterns.

Page 67: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

67

Grammatical Concept Extraction

<?xml version="1.0" encoding="UTF-8"?>

<Proper_Noun_Concept>

<Source><Source_Type>file</Source_Type>

<Source_Name>W:/Concept Extraction/Media Monitoring Negative Training Set/ 001B950F2EE8D0B4452570B4003FF816.txt</Source_Name>

</Source><Profile_Name>PEOPLE_ORG</Profile_Name>

<keywords>Abdul Salam Syed, Aruna Roy, Arundhati Roy, Arvind Kesarival, Bharat Dogra, Kwazulu Natal, Madhu Bhaduri, </keywords><keyword_count>7</keyword_count>

</Proper_Noun_Concept>

Proper Noun Profile for People Names uses grammars to find and extract the names of people referenced in the document.

Page 68: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

68

Grammatical Concept Extraction –People Names Client testing mode

Page 69: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

69

Rule-Based Categorization

• What is it?– Categorization is the process of grouping things based on

characteristics– Categorization technologies classify documents into groups or

collections of resources– An object is assigned to a category or schema class because it

is ‘like’ the other resources in some way– Categories form part of a hierarchical structure when applied to such

subjects as a taxonomy

• How does it work?– Automated categorization is an ‘inferencing’ task- meaning that we

have to tell the tools what makes up a category and then how to decide whether something fits that category or not

– We have to teach it to think like a human being – • When I see -- access to phone lines, analog cellular systems,

answer bid rate, answer seizure rate – I know this should be categorized as ‘telecommunications’

• We use domain vocabularies to create the category descriptions

Page 70: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

70

Rule Based Categorization• How do we build it?

1. Build the hierarchy of categoriesa) Manually if you have a scheme in place and maintained by peopleb) Programmatically if you need to discover what the scheme should be

2. Build a training set of content category by category – from all kinds of content

3. Describe each category in terms of its ‘ontology’ – in our case this means the concepts that describe it (generally between 1,000 and 10,000 concepts)

4. Filter the list to discover groups of concepts5. The richer the definition, the better the categorization engine works6. Test each category profile on the training set7. Test the category profile on a larger set that is outside the domain8. Insert the categirt profile into the profile for the larger hierarchy

• We built the Ontolog classification scheme using the programmatic approach – reference materials include the raw and refined lists, plus the ‘discovered classes’

Page 71: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

71

Rule Based Categorization

• State of the Industry– Only a handful of rule-based categorizers are on the

market today– Most of the existing technologies are dynamic

clustering tools– However, the market will probably grow in this area as

the demand grows

Page 72: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

72

Categorization Examples• Let’s look at some working examples by going to the Teragram

TK240 profiles

– Topics– Countries– Regions– Sector – Theme – Disease Profiles

• Other categorization profiles we’re also working on…

– Business processes (characteristics of business processes)– Sentiment ratings (positive media statements, negative media

statements, etc.)– Document types (by characteristics found in the documents)– Security classification (by characteristics found in the documents)

Page 73: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

73

Topic Hierarchy From Relationships across data classes

Build the rules at the lowest level of categorization

Page 74: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

74

Subtopics

Domain concepts or controlled vocabulary

Page 75: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

75

Topics Categorization Client Test

Page 76: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

76

Automatically Generated XML Metadata

Page 77: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

77

Automatically Generated Metadata

Page 78: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

78

Automatically Generated XML Metadata for Business Function attribute

• Office memorandum on requesting CD’s clearance of the Board Package for NEPAL: Economic Reforms Technical Assistance (ERTA)

Page 79: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

79

Clustering vs. Categorization

• Clustering Categorization

Page 80: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

80

Clustering• What is it?

– The use of statistical and data mining techniques to partition data into sets. Generally the partitioning is based on statistical co-occurrence of words, and their proximity to or distance from each other

• How does it work?

– Those words that have frequent occurrences close to one another are assigned to the same cluster

– Clusters can be defined at the set or the concept level – usually the latter

– Can work with a raw training set of text to discover and associate concepts or to suggest ‘buckets’ of concepts

– Some few tools can work with refined list of concepts to be clustered against a text corpus

– Please note the difference between clustering words in content and clustering domain concepts – major distinction

Page 81: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

81

Clustering

• How do we build it?1. Define the list of concepts2. Create the training set 3. Load the concepts into the clustering engine4. Generate the concept clusters

• State of Industry – Most of the commercial tools that call themselves

‘categorizers’ are actually clustering engines– Generally, doesn’t work at a high domain level for large

sets of text– They can provide insights into concepts in a domain

when used on a small set of documents– All the engines are resource intense, though, and the

outputs are transitory – clusters live only in the cluster index

– If you change the text set, the cluster changes

Page 82: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

82

Clustering Concepts

This is from the clustering output for 12.15.00 - Wildlife Resources.

‘Clusters’ of concepts between line breaks are terms from the Wildlife Resources controlled vocabulary found co-occurring in the same training document. This highlights often subtle relationships.

Page 83: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

83

Clustering Words in Content Clusters of words

based on occurrences in

the content

Page 84: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

84

Summarization

• What is it?– Rule-driven pattern matching and sentence extraction programs – Important to distinguish summarization technologies from some

information extraction technologies - many on the market extract ‘fragments’ of sentences – what Google does when it presents a search result to you

– Will generate document surrogates, poiint of view summaries, HTML metatag Description, and ‘gist’ or ‘synopsis’ for search indexing

– Results are sufficient for ‘gisting’ for html metatags, as surrogates for full text document indexing, or as summaries to display in search results to give the user a sense of the content

• How does it work?– Uses rules and conditions for selecting sentences– Enables us to define how many sentences to select– Allows us to tell us the concepts to use to select sentences– Allows us to determine where in the sentence the concepts might

occur– Allows us to exclude sentences from being selected– We can write multiple sets of rules for different kinds of content

Page 85: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

85

Summarization• How do we build it?

1. Analyze the content to be summarized to understand the type of speech and writing used – IRIS is different from Publications is different from News stories

2. Identify the key concepts that should trigger a sentence extraction3. Identify where in the sentence these concepts are likely to occur4. Identify the concepts that should be avoided5. Convert concepts and conditions to a rule format6. Load the rule file onto the summarization server7. Test the rules against test set of content and refine until ‘done’8. Launch the summarization engine and call the rule file

• State of Industry – Most tools are either readers or extractors. Reader method uses clustering &

weighting to promote sentence fragments. Extractor method uses internal format representation, word & sentence weighting

– What has been missing from the Extractors in most commercial products is the capability to specify the concepts and the rules. Teragram is the only product we found to support this.

Page 86: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

86

Summarization Rules

CodeWhere would appear in the

sentence It is likely to be included Syntax

5 anywhere in the sentence It is likely not to be included copyright/2004,5

9 anywhere in the sentence Definitely not included for/example,9

7 anywhere in the sentence Definitely to be included got/the/top/grade,7

10 anywhere in the sentence It is likely to be included pull/off/that/coup,10

2anywhere in the sentence,

followed by the second It is likely to be included evidence,2:collected

1 beginning of the sentence It is likely to be included we/report,1

6 beginning of the sentence Definitely to be included reporting/on,6

8 beginning of the sentence Definitely not included copyright/reserved,8

3

beginning of the sentence; only if the preceding sentence

qualifies It is likely to be included however,3

4

beginning of the sentence; only if the preceding sentence

qualifies Definitely to be included the/former,4

Page 87: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

87

Automatically Generated Gist

• PID Bosnia-Herzegovina Private Sector Credit Project• Rules

– agreed/to,10

– with/the/objective,10

– objective,2:project

– proposed,2:project

– assist/in,10

• Gist

Page 88: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

88

Step 5. Generate Values for the Ontolog Ontology

Page 89: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

89

Sample Dimensions of Ontolog Ontology

• Names of organizations and companies (Rule based concept extraction)

• Names of people (Grammar based concept extraction)

• Countries (Rule based categorization)

• Ontology facets or subdomains (Grammar based concept extraction + rule based categorization) – Attachment #1

• • Domain Vocabulary/Concept Lists (Grammar based concept extraction) –

Attachment #2

Page 90: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

90

Step 6. Review and Validation of Ontology Values

Page 91: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

91

Expert Review of Facets

• Are all of the core facets of ontologies included in the list? If not, what is missing?

• We have identified some facets as related but not essential aspects of ontologies. Have we characterized these correctly? If not, what should be changed?

• What is included in the list that should not be? This includes both core and related facets.

• It is generally a good idea to try to limit facets to no more than 30 (what a human mind can retain in short term memory)

Page 92: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

92

Expert Review of Concept Lists

1. If you were talking about ontology with an expert, are all of the concepts you would use included in the domain concept list? If not, what is missing?

2. Are there a few concepts missing, or is there a larger subdomain or knowledge area that is missing?

3. What is in the list that is core to ontologies? What is only related to ontologies?

4. If you were looking for information about ontologies – from an expert point of view – would you use any of these concepts to search? Which ones are missing? What shouldn’t be in the list?

5. If you were looking for information about ontologies from a novice’s point of view – what is missing from the list of concepts? What shouldn’t be included?

Page 93: 1 Ontologizing the Ontolog Content Protégé Workshop Denise A. D. Bedford, Ph.d. July 23, 2006.

93

Step 7. Operationalizing the Ontology