Top Banner
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01
34

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Text Mining: Finding Nuggets in Mountains

of Textual Data

Jochen Dörre, Peter Gerstl, and Roland Seiffert

Presented By: Jake Happs, 4.11.01

Page 2: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Overview

• Reasons for Text Mining

• Special Tasks in Mining Text

• Disambiguating Proper Names

• Application Types

• Customer Intelligence

Page 3: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Reasons for Text Mining

• Corporate Knowledge “Ore”

• Exploiting the Knowledge in Text

• The Value of Mining Text

• Typical Applications

Page 4: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Corporate Knowledge “Ore”

• Email• Insurance claims• News articles• Web pages• Patent portfolios

• Customer complaint letters

• Contracts• Transcripts of phone

calls with customers• Technical documents

Page 5: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Exploiting Textual Knowledge

• Knowledge Discovery

• Knowledge Management

Page 6: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Value of Text Mining

• Rapid digestion of large corporate documents, faster than human knowledge brokers

• Objective and customizable analysis

• Automation of routine tasks

Page 7: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Typical Applications

• Summarizing documents

• Monitoring relations among people, places, and organizations

• Organize documents by content

• Organize indices for search and retrieval

• Retrieve documents by content

Page 8: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Special Tasks in Mining Text

• Interpreting Natural Language

• Comparison with Data Mining

• Extracting Terminology and Relations

• Classifying Documents

Page 9: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Interpreting Natural Language

• Extracting terminology

• Extracting relations

• Summarizing documents

• Extracting models

Page 10: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Comparison of Procedures

Data Mining• Identify data sets.• Select features

manually.• Prepare data.• Analyze distribution.

Text Mining• Identify documents.• Extract features.• Select features by

algorithm.• Prepare data.• Analyze distribution

Page 11: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Terminology and Relations

• What Terminology Is

• Classes of Terms

• Instances of Relations

• Canonical Forms

Page 12: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

What Terminology Is

• Function words

• General-purpose content words and phrases

• Technical content words and phrases

• Relations

Page 13: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Classes of Terminology

• Proper names

• Technical phrases

• Abbreviations and acronyms

Page 14: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Instances of Relations

• Facts

• Dates

• Currency values

• Percentages

• Other measurements

Page 15: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Canonical Forms

• Numbers convert to normal form.

• Dates convert to normal form.

• Inflected forms convert to common form.

• Alternative names convert to explicit form.

Page 16: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Classifying Documents

• Hierarchical clustering

• Binary relational clustering

• Supervised learning

Page 17: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Disambiguating Proper Names

• Principles of Nominator Design

• The Process in Nominator

Page 18: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Principles of Nominator Design

• Apply heuristics to strings, instead of interpreting semantics.

• The unit of context for extraction is a document.

• The unit of context for aggregation is a corpus.

• The heuristics represent English naming conventions.

Page 19: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Extracting Proper Names

• Tokenize the words in a document.

• Build list of candidate names in document.

• Break candidates into smaller names.

• Group names into equivalence classes.

• Aggregate classes from multiple documents.

Page 20: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Candidate Names

• Extract all sequences of capitalized tokens.• Exclude adjectives of provenance (e.g. Mr., Dr.,

etc.).• Exclude certain non-name acronyms (e.g. M.D.,

PhD.).• Include numerals, unless following a preposition,

comma, date, or number.• Ignore words in section titles.• Exclude initial adverbs in sentences.

Page 21: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Splitting Candidates

• Apply heuristics to conjunctions, prepositions, and possessives.

• Reconstruct shared words.

Page 22: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Building Equivalence Classes

• Discard non-recurring initial words of sentences.

• Unify variants with heuristics.

• Pick canonical name for each class.

• Categorize each class with heuristics.

• Map canonical name to variants.

• Map variants to canonical name.

Page 23: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Aggregating Classes

• Merge classes that share a variant in separate documents.

• Both type and spelling of variant must agree.

• Replace uncertain categories with certain ones.

Page 24: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Application Types

• Knowledge Discovery (Clustering)

• Information Distillation (Categorization)

Page 25: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Knowledge Discovery

Page 26: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Information Distillation

Page 27: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Customer Intelligence

• Goals

• Process

Page 28: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Customer Intelligence Goals

• What do customers want and need?

• What do customers think of the company?

Page 29: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Customer Intelligence Process

• Corpus of communications with customers

• Cluster the documents to identify issues.

• Characterize the clusters to identify the conditions for problems.

• Assign new messages to appropriate clusters.

Page 30: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Summary

• Reasons for Text Mining

• Special Tasks in Mining Text

• Disambiguating Proper Names

• Customer Intelligence

Page 31: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Exam Question #1

• Name an example of each of the two main classes of applications of text mining.– Knowledge Discovery: Discovering a common

customer complaint among much feedback.– Information Distillation: Filtering future

comments into pre-defined categories

Page 32: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Exam Question #2

• How does the procedure for text mining differ from the procedure for data mining?– Adds feature extraction function– Not feasible to have humans select features– Highly dimensional, sparsely populated feature

vectors

Page 33: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Exam Question #3

• In the Nominator program of IBM’s Intelligent Miner for Text, an objective of the design is to enable rapid extraction of names from large amounts of text. How does this decision affect the ability of the program to interpret the semantics of text?– Does not perform in-depth syntactic or

semantic analyses of texts

Page 34: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Questions & Answers