Semantic Web and Knowledge Management

Introduction to Semantic webThe Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. Semantic Web architecture and applications are the next generation in information architecture. The word semantic stands for the meaning of. The Semantic Web = a Web with a meaning.The idea of having data on the web defined and linked in a way that it can be used by machines for various applications. It Enables Computers to find out knowledge which is distributed throughout the web. The Features of Semantic Web are:

Information Flexibility

Able to relate each other

First Generation - Keywords

Keyword technologies were originally used in IBMs free text retrieval system in the late 1960s. These tools are based on a simple scan of a text document to find a key word or root stem of a key word. This approach can find key words in a document, and can list and rank documents containing key words. But, these tools have no ability to extract the meaning from the word or root stem, and no ability to understand the meaning of the sentence.

Advanced Search

Most keyword systems now include some form of Boolean logic AND , OR functions to narrow searches. This is often called advanced search. But, using Boolean logic to exclude documents from a search is not advanced . It is an arbitrary and random means to reduce the size of the source database to reduce the number of documents retrieved. This advanced search significantly increases false negatives by missing many relevant source documents.

Examples:

The most common examples of key word tools are web site Search tools and the Microsoft Find function (control f key) in Microsoft Office applications.

Second Generation - Statistical Forecasting

Statistical forecasting first finds keywords; and then calculates the frequency and distance of these keywords. Statistical forecasting tools now include many techniques for predictive forecasting, most often using inference theory. The frequency and distribution of words has some general value in understanding content. But, cannot understand the meaning of words or sentences; or provide context. These tools are still limited by keyword constraints; and can only infer simplistic meaning from the frequency and distribution of words.

Applications:

Statistical forecasting tools are appropriate for performing simple document searches where the desired output is a list of documents which contain specific words which must then be read and classified and summarized manually by end users. These are not capable of understanding the meaning or context or relationships of documents.

Problems:

The most common problems with statistical forecasting tools are: a) Keyword limitations of false positives and false negatives;b) Misunderstanding the meaning of words and sentences (man bites dog is the same as dog bites man); c) Lack of contextExamples:

The most common statistical forecasting tool is Google and many other tools using inference theory and similar analysis and predictive algorithms.

Third Generation - Natural Language Processing

Natural language processors focus on the structure of language. These recognize that certain words in each sentence (nouns and verbs) play a different role (subject-verb-object) than others (adjectives, adverbs, articles). This understanding of grammar increases the understanding of key words and their relationships. (man bites dog is different from dog bites man). But, these tools cannot extract the understanding of the words or their logical relationship beyond their basic grammar. And, these cannot perform any information summary, analysis or integration functions.

Applications:

Natural language tools are appropriate for linguistic research and word-for-word translation applications where the desired output is a linguistic definition or a translation. These are not capable of understanding the meaning or context of sentences in documents, or integrating information within a database.

Problems:

The most common problems with linguistic tools are: a) keyword limitations of false positives and false negatives; b) misunderstanding the context (does I like java mean an island in Indonesia, a computer programming language or coffee?) Without understanding the broader context, a linguistic tool only has a dictionary definition of Java and does not know which Java is relevant or what other data related to a specific Java concept.

Examples:

The most common natural language tools are translator programs which use dictionary look up tables to convert words and language-specific grammar to convert source to target languages.

Fourth Generation Semantic Web Architecture and Applications

Semantic web architecture and applications are a dramatic departure from earlier database and applications generations. Semantic processing includes these earlier statistical and natural langue techniques, and enhances these with semantic processing tools. First, Semantic Web architecture is the automated conversion and storage of unstructured text sources in a semantic web database. Second, Semantic Web applications automatically extract and process the concepts and context in the database in a range of highly flexible tools.

a. Architecture; not only Application

The Semantic web is complete database architecture, not only an application program. Semantic web architecture combines a two-step process. First, a Semantic Web database is created from unstructured text documents. And, then Semantic Web applications run on the Semantic Web database; not the original source documents.

The Semantic Web architecture is created by first converting text files to XML and then analyzing these with a semantic processor. This process understands the meaning of the words and grammar of the sentence, and also the semantic relationships of the context. These meanings and relationships are then stored in a Semantic web database. The Semantic Web is similar to the schematic logic of an electronic device or the DNA of a living organism. It contains all of the logical content AND context of the original source. And, it links each word and concept back to the original document.

Semantic Web applications directly access the logical relationships in the Semantic Web database. Semantic web applications can efficiently and accurately search, retrieve, summarize, analyze and report discrete concepts or entire documents from huge databases.

A search for Java links directly to the three Semantic Web logical clusters for Java: (island in Indonesia, a computer programming language, and coffee). The processor can now query the user for which Java, and then expand the search to all other concepts and documents related to the specific Java.

b. Structured and Unstructured Data

Semantic Web architecture and applications handle both structured and unstructured data. Structured data is stored in relational databases with static classification systems, and also in discrete documents. These databases and documents can be processed and converted to Semantic Web databases, and then processed with unstructured data.

Much of the data we read, produce and share is now unstructured; emails, reports, presentations, media content, web pages. And, these documents are stored in many different formats; text, email files, Microsoft word processor, spreadsheet, presentation files, Lotus Notes, Adobe.pdf, and HTML. It is difficult, expensive, slow and inaccurate to attempt to classify and store these in a structured database. All of these sources can be automatically converted to a common Semantic Web database, and integrated into one common information source.

c. Dynamic and Automatic; not Static and Manual

Semantic Web database architecture is dynamic and automated. Each new document which is analyzed, extracted and stored in the Semantic Web expands the logical relationships in all earlier documents. These expanding logical relationships increase the understanding of content and context in each document, and the entire database. The Semantic Web conversion process is automated. No human action is required for maintaining a taxonomy, meta data tagging or classification. The semantic database is constantly updated and more accurate.

Semantic Web architecture is different from relational database systems. Relational databases are manual and static because these are based on a manual process for maintaining a taxonomy, meta data tagging and document classification in static file structures. Documents are manually captured, read, tagged, classified and stored in a relational database only once, and not updated. More important, the increase in new documents and information in a relational database does not make the database more intelligent about the concepts, relationships or documents.

d. From Machine Readable to Machine Understandable

Semantic Web architecture and applications support both human and machine intelligence systems. Humans can use Semantic Web applications on a manual basis, and improve the efficiency of search, summary, analysis and reporting tasks. Machines can also use Semantic Web applications to perform tasks that humans cannot do; because of the cost, speed, accuracy, complexity and scale of the tasks.

e. Synthetic vs Artificial Intelligence:

Semantic Web technology is NOT Artificial Intelligence. AI was a mythical marketing goal to create thinking machines. The Semantic Web supports a much more limited and realistic goal. This is Synthetic Intelligence. The concepts and relationships stored in the Semantic Web database are synthesized, or brought together and integrated, to automatically create a new summary, analysis, report, email, alert; or launch another machine application. The goal of Synthetic Intelligence information systems is bringing together all information sources and user knowledge, and synthesizing these in global networks. Semantic Web Building Blocks

(1) URI A URI is simply a Web identifier, like the strings starting with Http or Ftp that is often seen on the World Wide Web. Anyone can create a URI. Every data object and every data schema/model in the Semantic Web must have a unique URI. A Uniform Resource Locator (URL) is a URI that, in addition to identifying a resource, provides a means of acting upon or obtaining a representation of that resource by describing its primary access mechanism or network location.(2) RDF stands for Resource Description Framework

RDF is a framework for describing resources on the web

RDF is designed to be read and understood by computers

RDF is not designed for being displayed to people

RDF is written in XML

RDF is a part of the W3C's Semantic Web Activity

RDF is a W3C Recommendation

RDF extends the linking structure of the Web to use URIs to name the relationship between things as well as the two ends of the link. The underlying structure of any RDF document is a collection of triples. This collection of triples is usually called the RDF graph. Each triple states a relationship (aka. edge, property) between two nodes (aka.resource) in the graph. This abstract data model is independent of concrete serialization syntax. Therefore query languages usually do not provide features to query serialization-specific features, e.g. order of serialization.

Using this simple model, it allows structured and semi-structured data to be mixed, exposed, and shared across different applications. RDF provides a general, flexible method to decompose any knowledge into small pieces, called triples, with some rules about the semantics (meaning) of those pieces. By using XML, RDF information can easily be exchanged between different types of computers using different types of operating systems and application languages.

Support for XML schema data types.

XML data types can be used to represent data values in RDF. XML Schema also provides an extensibility framework suitable for defining new data types for use in RDF. Data types should therefore be supported in an RDF query language.

Free support for making statements about resources. In general, it is not assumed that complete information about any resource is available in the RDF query. A query language should be aware of this and should tolerate incomplete or contradicting information.

(3) RDFS

RDF Schema extends RDF and is a vocabulary for describing properties and classes of RDF-based resources, with semantics for generalized-hierarchies of such properties and classes. Classes in RDF Schema are much like classes in object oriented programming languages. This allows resources to be defined as instances of classes, and subclasses of classes. RDFS is used to define relations between resources and organize the hierarchy.(4) OWL

OWL is a language for processing web information. OWL stands for Web Ontology Language. OWL is built on top of RDF. OWL is for processing information on the web. OWL was designed to be interpreted by computers. OWL was not designed for being read by people. OWL is written in XML. OWL has three sublanguages.

RDF Query languageQuery Language Properties

Expressiveness

Expressiveness indicates how powerful queries can be formulated in a given language. Typically, a language should at least provide the means offered by relational algebra, i.e. be relationally complete. Usually, expressiveness is restricted to maintain other properties such as safety and to allow an efficient (and optimizable) execution of queries.

ClosureThe closure property requires that the results of an operation are again elements of the data model. This means that if a query language operates on the graph data model, the query results would again have to be graphs.

AdequacyA query language is called adequate if it uses all concepts of the underlying data model. This property therefore complements the closure property: For the closure, a query result must not be outside the data model, for adequacy the entire data model needs to be exploited.

OrthogonalityThe orthogonality of a query language requires that all operations may be used independently of the usage context.

SafetyA query language is considered safe, if every query that is syntactically correct returns a finite set of results (on a finite data set). Typical concepts that cause query languages to be unsafe are recursion, negation and built-in functions.

Query Languages are

RQL

RQL is a typed language following a functional approach, which supports generalized path expressions featuring variables on both nodes and edges of the RDF graph. RQL relies on a formal graph model that captures the RDF modeling primitives and permits the interpretation of superimposed resource descriptions by means of one or more schemas.

SeRQL

Sesame RDF Query Language and is a querying and transformation language loosely based on several existing languages, most notably RQL, RDQL. Its primary design goals are unification of best practices from query language and delivering a light-weight yet expressive query language for RDF that addresses practical concerns.

RDQL

The syntax of RDQL follows a SQL-like select pattern, where a from clause is omitted. For example, select ?p where (?p, , "foo" ) collects all resources with label foo in the free variable p. The select clause at the beginning of the query allows projecting the variables. Namespace abbreviations can be defined in a query via a separate using clause. RDF Schema information is not interpreted. Since the output is a table of variables and possible bindings, RDQL does not fulfill the closure and orthogonality property. RDQL is safe and offers preliminary support for datatypes.RDF model is it different from the XML model?RDF defines a data model based on triples: Object, Property, and Value. triple(author, page, Ora)

Representation is as follows

XML representation is

page

Ora

or maybe

Ora

or maybe

href="page"

Ora

or maybe

href="page"

Ora

The XML Graph

These are all perfectly good XML documents - and to a person reading then they mean the same thing. To a machine parsing them, they produce different XML trees.Suppose you look at the XML tree

a="ppppp"

qqqqq

Looking at the simple XML encoding above,

page

Ora

it could be represented as a graph

We can represent the tree more concisely if we make shorthand by writing the name of each element inside its circle:

Ontologies

Ontologies will play a major role in supporting information exchange processes in various areas. Ontologies were developed in Artificial Intelligence to facilitate knowledge sharing and reuse. Ontology is also becoming widespread in fields such as intelligent information integration, cooperative information systems, information retrieval, electronic commerce, and knowledge management. The reason ontologies are becoming so popular is in large part due to what they promise: a shared and common understanding of some domain that can be communicated between people and application systems. Because ontologies aim at consensual domain knowledge their development is often a cooperative process involving different people, possibly at different locations. Components of an Ontology

A computational ontology consists of a number of different components, such as Classes, Individuals and Relations.

Concept

Concepts, also called Classes, Types or Universals are a core component of most ontology. A Concept represents a group of different individuals that share common characteristics, which may be more or less specific. Concepts also share relationships with each other; these describe the way individuals of one Concept relate to the individuals of another.

Individual

Individuals also known as instances or particulars are the base unit of an ontology; they are the things that the ontology describes or potentially could describe. Individuals may model concrete objects such people, machines or proteins; they may also model more abstract objects such as this article, a persons job or a function.

Relation

Relations in ontology describe the way in which individuals relate to each other. Relations can normally be expressed directly between or between Concepts.Ontology Applications: Natural Language Applications

Knowledge Management

Enterprise Application Integration

E-Commerce

In database and information Retrieval

Knowledge Management

Knowledge Management is concerned with acquiring, maintaining, and accessing knowledge of an organization. It aims to exploit an organisation's intellectual assets for greater productivity, new value, and increased competitiveness. Knowledge management systems have severe weaknesses:

Searching information: Existing keyword-based search retrieves irrelevant information which uses a certain word in a different context, or it may miss information where different words about the desired content are used.

Extracting information: Human browsing and reading is currently required to extract relevant information from information sources, as automatic agents lack all common sense knowledge required to extract such information from textual representations, and they fail to integrate information spread over different sources.

Maintaining weakly structured text sources is a difficult and time-consuming activity when such sources become large. Keeping such collections consistent, correct, and up-to-date requires a mechanized representation of semantics and constraints that help to detect anomalies.

Automatic document generation: Adaptive web sites which enable a dynamic reconfiguration according to user profiles or other relevant aspects would be very useful. The generation of semistructured information presentations from semi-structured data requires a machine-accessible representation of the semantics of these information sources.

EMBED PowerPoint.Slide.12

13

URI, HTML, HTTP

Static

WWW

Serious Problems in informationfindingextractingrepresentinginterpretingand maintaining

RDF, RDFS, OWL

Semantic Web

Semantic Web

URI, HTML, HTTP

Static

WWW

500 million usermore than 3 billion pages

Current Web

Semantic Web and Knowledge Management

Documents

meaning of words

statistical forecasting

key words

web site search tools

semantic web architecture

semantic webthe semantic

features of semantic

predictive forecasting