Top Banner
Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology The Pennsylvania State University, University Park, PA, USA [email protected] http://clgiles.ist.psu.edu IST 511 Information Management: Information and Technology Information extraction, data mining, metadata Special thanks to E. Agichtein, K. Borne, S. Sarawagi, C. Lagoze,
88

Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Jan 03, 2016

Download

Documents

lacy-maddox

IST 511 Information Management: Information and Technology Information extraction, data mining, metadata. Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology The Pennsylvania State University, University Park, PA, USA [email protected] - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Dr. C. Lee GilesDavid Reese Professor, College of Information Sciences

and Technology

The Pennsylvania State University, University Park, PA, USA

[email protected]

http://clgiles.ist.psu.edu

IST 511 Information Management: Information and Technology

Information extraction, data mining, metadata

Special thanks to E. Agichtein, K. Borne, S. Sarawagi, C. Lagoze,

Page 2: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Last time

What are probabilitiesWhat is information theoryWhat is probabilistic reasoning

– Definitions– Why important– How used – decision making– Decision trees

Impact on information science

Page 3: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Today

What is information extractionWhat is data mining

– Text mining as subfield

What is metadataImpact on information science

Page 4: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Tomorrow

Topics used in IST• Digital libraries, • Scientometrics, bibliometrics• Digital humanities

Page 5: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Theories in Information SciencesEnumerate some of these theories in this course.Issues:

– Unified theory?– Domain of applicability– Conflicts

Theories here are – Very algorithmic– Some quantitative– Some qualitative

Quality of theories– Occam’s razor– Subsumption of other theories (all can use machine learning)– Text mining special case of data mining– Natural language processing uses data mining methods

Theories– Natural language processing

Page 6: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Science ParadigmsThousand years ago: science was empirical

describing natural phenomena

Last few hundred years: theoretical branch

using models, generalizations

Last few decades: a computational branch

simulating complex phenomena

Today: data science (eScience)

unify theory, experiment, and simulation – Data captured by instruments

or generated by simulator– Processed by software– Information/Knowledge stored in computer– Scientist analyzes database / files

using data management and statistics

2

22.

3

4

a

cG

a

aΚ−=

⎟⎟⎟

⎜⎜⎜

⎛ ρπ2

22.

3

4

a

cG

a

aΚ−=

⎟⎟⎟

⎜⎜⎜

⎛ ρπ

Page 7: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Information extraction, data mining and natural language processing

• Natural language processing is the processing and understanding of human language by machines• Information Extraction can be considered a subclass• Also known as knowledge extraction

• Data mining is the process of discovering new patterns from large data sets

• Text mining is the data mining of text• Text analytics generally refers to the tools used

• Information extraction is the process of extracting and labeling relevant data from large data sets, usually text

• Large means manually unreasonable

Page 8: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

The Value of Unstructured Text Data

“Unstructured” text data is the primary form of human-generated information– Business and government reports, blogs, web pages, news,

scientific literature, online reviews, …Need to extract information and give it structure to effectively

manage, search, mine, store and utilize this dataInformation Extraction: maturing, and active research area

– Software and companies exist– Intersection of Computational Linguistics, Machine

Learning, Data mining, Databases, and Information Retrieval

Active crawling for text data

Page 9: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Example: Answering Queries Over Text

For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Name Title OrganizationBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman Founder Free Soft..

PEOPLE

Select Name From PEOPLE Where Organization = ‘Microsoft’

Bill Gates

Bill Veghte

(from William Cohen’s IE tutorial, 2003)

Page 10: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Information extraction from text or pdfsFor years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Name Title OrganizationBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman Founder Free Soft..

PEOPLE

Select Name From PEOPLE Where Organization = ‘Microsoft’

(William Cohen’s IE tutorial, 2003)

For extraction of OAI metadata from academic documents, see CiteSeerX

citeseerx.ist.psu.edu

For extraction of OAI metadata from academic documents, see CiteSeerX

citeseerx.ist.psu.edu

XML or database

Page 11: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Information Extraction TasksExtracting entities and relations: this talk

– Entities: named (e.g., Person) and generic (e.g., disease name)

– Relations: entities related in a predefined way (e.g., Location of a Disease outbreak, or a CEO of a Company)

– Events: can be composed from multiple relation tuples

Common extraction subtasks:

– Preprocess: sentence chunking, syntactic parsing, morphological analysis

– Create rules or extraction patterns: hand-coded, machine learning, and hybrid

– Apply extraction patterns or rules to extract new information

– Postprocess and integrate information• Co-reference resolution, deduplication, disambiguation

Page 12: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Entities

Wikipedia: An entity is something that has a distinct, separate existence, although it need not be a material existence.

Features:– Permanent vs transient– Unique vs common– Animate vs inanimate– Small vs large– Mobile vs sessile– Place vs thing– Abstract vs real– Bio labels

Digital mention or reference

Page 13: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Example: Extracting Entities from Text

– Useful for data warehousing, data cleaning, web data integration

14089 Whispering Pines Nobel Drive San Diego CA 92122

House number Building Road City ZipState

Address

Citation

Segment(si) Sequence Label(si)

S1 Ronald Fagin Author

S2 Combining Fuzzy Information from Multiple Systems Title

S3 Proc. of ACM SIGMOD Conference

S4 2002 Year

Ronald Fagin, Combining Fuzzy Information from Multiple Systems, Proc. of ACM SIGMOD, 2002

Page 14: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Entity Disambiguation

Task of clustering and linking similar entities in a document or between documents.– Labels sometime complex are given to these entities

Sometimes includes task of extracting or finding those entities (information extraction, focused crawling, etc)

Page 15: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Hand-Coded MethodsEasy to construct in some cases

– e.g., to recognize prices, phone numbers, zip codes, conference names, etc.

Intuitive to debug and maintain– Especially if written in a “high-level” language:

– Can incorporate domain knowledge

Scalability issues:– Labor-intensive to create– Highly domain-specific – Often corpus-specific– Rule-matches can be expensiveContactPattern RegularExpression(Email.body,”can be reached at”)

[IBM Avatar]

Page 16: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Entity Disambiguation by some other name?record linkage

merge/purge processing or list washing

data matching

object identity problem

named entity resolution

duplicate detection

record matching

instance identification

deduplication

coreference resolution

reference reconciliation

database hardening

Closely related to Natural Language Processing

Page 17: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Entity Disambiguation Applications

Speech understanding

Question/answering

Health records

Criminal activities

Finance records

Semantic web applications

Scientific discovery and search

Semantic search

Others?

Page 18: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Entity TaggingIdentifying mentions of entities (e.g., person names, locations, companies) in

text – MUC (1997): Person, Location, Organization, Date/Time/Currency

– ACE (2005): more than 100 more specific types

Hand-coded vs. Machine Learning approaches

Best approach depends on entity type and domain:– Closed class (e.g., geographical locations, disease names, gene & protein

names): hand coded + dictionaries

– Syntactic (e.g., phone numbers, zip codes): regular expressions

– Semantic (e.g., person and company names): mixture of context, syntactic features, dictionaries, heuristics, etc.

– “Almost solved” for common/typical entity types

Page 19: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Machine Learning MethodsCan work well when lots of training data and easy to construct

Can capture complex patterns that are hard to encode with hand-crafted rules– e.g., determine whether a review is positive or negative– extract long complex gene names– Non-local dependencies

The human T cell leukemia lymphotropic virus type 1 Tax protein represses MyoD-dependent transcription by inhibiting MyoD-binding to the KIX domain of p300.“ [From AliBaba]

Page 20: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Representation Models [Cohen and McCallum, 2003]

Any of these models can be used to capture words, formatting or both.

Lexicons

AlabamaAlaska…WisconsinWyoming

Sliding WindowClassify Pre-segmentedCandidates

Finite State Machines Context Free GrammarsBoundary Models

Abraham Lincoln was born in Kentucky.member?

Abraham Lincoln was born in Kentucky.Abraham Lincoln was born in Kentucky.

Classifier

which class?

…and beyond

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Try alternatewindow sizes:

Classifier

which class?

BEGIN END BEGIN END

BEGIN

Abraham Lincoln was born in Kentucky.

Most likely state sequence?NNP V P NPVNNP

NP

PP

VP

VP

S

Mos

t lik

ely

pars

e?

Abraham Lincoln was born in Kentucky.

Page 21: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

(Person) Name Disambiguation

Person Name disambiguation– A person can be referred to in different ways with different attributes in multiple records, the goal of name

disambiguation is to resolve such ambiguities, linking and merging all the records of the same entity together

– Large # of mentions and entities

Consider three types of person name ambiguities:– Aliases - one person with multiple aliases, name variations, or name changed

e.g. CL Giles & Lee Giles, Superman & Clark Kent

– Common Names - more than one person shares a common name, e.g. Jian Huang – 118 papers in DBLP

– Typography Errors - resulting from human input or automatic extraction

Goal: disambiguate, cluster and link names in a large digital library or bibliographic resource such as Medline

Page 22: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Popular Machine Learning Methods

Naive Bayes

SRV [Freitag 1998], Inductive Logic Programming

Rapier [Califf and Mooney 1997]

Hidden Markov Models [Leek 1997]

Maximum Entropy Markov Models [McCallum et al. 2000]

Conditional Random Fields [Lafferty et al. 2001]

Scalability– Can be labor intensive to construct training data

– At run time, complex features can be expensive to construct or process (batch algorithms can help: [Chandel et al. 2006] )

For details: [Feldman, 2006 and Cohen, 2004]

Page 23: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Data mining?

Process of semi-automatically analyzing large data sets and databases to find patterns that are:– valid: hold on new data with some certainity

– novel: non-obvious to the system

– useful: should be possible to act on the item

– understandable: humans should be able to interpret the pattern

Page 24: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Evolutionary Step Business Question EnablingTechnologies

Characteristics

Data Collection(1960s)

"What was my totalrevenue in the last fiveyears?"

Computers, tapes, disks Retrospective, staticdata delivery

Data Access(1980s)

"What were unit sales inNew England lastMarch?"

Relational databases(RDBMS), StructuredQuery Language (SQL),ODBC

Retrospective, dynamicdata delivery at recordlevel

Data Warehousing &Decision Support(1990s)

"What were unit sales inNew England lastMarch? Drill down toBoston."

On-line analyticprocessing (OLAP),multidimensionaldatabases, datawarehouses

Retrospective, dynamicdata delivery at multiplelevels

Data Mining(Emerging Today)

"What’s likely tohappen to Boston unitsales next month?Why?"

Advanced algorithms,multiprocessorcomputers, massivedatabases

Prospective, proactiveinformation delivery

Evolution of Data Mining<http://www.thearling.com/text/dmwhite/dmwhite.htm>

Page 25: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Data Mining is Ready for Prime Time

• Data mining is ready for general application because it engages three technologies that are now sufficiently mature:

Massive data collection & delivery

Powerful multiprocessor computers

Sophisticated data mining algorithms

Page 26: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Organizational Reasons to use Data Mining

– Most organizations already collect and refine massive quantities of data.

– Their most important information is in their data warehouses.

– Data mining moves beyond the analysis of past events … to predicting future trends and behaviors that may be missed because they lie outside the experts’ expectations.

– Data mining tools can answer complex business questions that traditionally were too time-consuming to resolve.

– Data mining tools can explore the intricate interdependencies within databases in order to discover hidden patterns and relationships.

– Data mining allows decision-makers to make proactive, knowledge-driven decisions.

Page 27: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology
Page 28: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

A Key Concept for Data Mining

• Data Mining delivers actionable data :– data that support decision-making

– data that lead to knowledge and understanding

– data with a purpose

• i.e., Data do not exist for their own sake.

• The Data Warehouse is a corporate asset (whether in business, marketing, banking, science, telecommunications, entertainment, computer security, or security).

Page 29: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Data Mining - the up side

Data mining is everywhere:– Huge scientific databases (NASA, Human Genome,…)– Corporate databases (OLAP)– Credit card usage histories (Capital One)– Loan applications (Credit Scoring)– Customer purchase records (CRM)– Web traffic analysis (Doubleclick)– Network security intrusion detection (Silent Runner)– The hunt for terrorists– The NBA!

Page 30: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Data Mining - the down side• Data mining is a pejorative in the business

database community (“data dredging”)– They prefer to call it Knowledge Discovery, or

Business Intelligence, or CRM (Customer Relationship Management), or Marketing, or OLAP (On-Line Analytical Processing)

• Legal issues in many countries• The Data Mining Moratorium Act of 2003

– debated within the U.S.Congress– privacy concerns– directly primarily against the DARPA TIA Program

(Total Information Awareness)

Page 31: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Characteristics of The Information Age:• Data “Avalanche”

– the flood of Terabytes of data is already happening, whether we like it or not

– our present techniques of handling these data do not scale well with data volume

• Distributed Digital Archives– will be the main access to data– will need to handle hundreds to thousands of queries per day

• Systematic Data Exploration and Data Mining– will have a central role

• statistical analysis of “typical” events

• automated search for “rare” events

Page 32: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

The Data Flood is Everywhere

Huge quantities of data are being generated in all business, government, and research domains:

– Banking, retail, marketing, telecommunications, other business transactions ...

– Scientific data: genomics, astronomy, biology, etc.

– Web, text, and e-commerce

Page 33: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Data Growth Rate

Exabytes

10-fold Growth in 5

Years!

DVDRFID

Digital TVMP3 players

Digital camerasCamera phones, VoIP

Medical imaging, Laptops,Data center applications, Games

Satellite images, GPS, ATMs, ScannersSensors, Digital radio, DLP theaters, Telematics

Peer-to-peer, Email, Instant messaging, Videoconferencing,CAD/CAM, Toys, Industrial machines, Security systems, Appliances

Source: IDC, 2008

Page 34: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

What is Data Mining?Data mining is defined as “an information extraction

activity whose goal is to discover hidden facts contained in (large) databases."

Data mining is used to find patterns and relationships in data. (EDA = Exploratory Data Analysis)

Patterns can be analyzed via 2 types of models:– Descriptive : Describe patterns and create

meaningful subgroups or clusters.– Predictive : Forecast explicit values, based upon

patterns in known results.

How does this become useful (not just bits of data)? ... – … through KNOWLEDGE DISCOVERY

Data Information Knowledge Understanding / Wisdom!

Page 35: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Historical Note: Many Names of Data Mining

Data Fishing, Data Dredging: 1960-– used by Statisticians (as a bad name)

Data Mining :1990- – used by DB & business communities– in 2003 – bad image because of DARPA TIA

Knowledge Discovery in Databases (1989-)– used by AI & Machine Learning communities

also Data Archaeology, Information Harvesting, Information Discovery, Knowledge Extraction, ...

Currently: Data Mining and Knowledge Discovery are seemed to be used interchangeably.

Page 36: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Relationship with other fields

Overlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress on– scalability of number of features and instances

– stress on algorithms and architectures whereas foundations of methods and formulations provided by statistics and machine learning.

– automation for handling large, heterogeneous data

Page 37: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Some basic operations

Predictive:– Regression

– Classification

– Collaborative Filtering

Descriptive:– Clustering / similarity matching

– Association rules and variants

– Deviation detection

Page 38: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Data Mining Examples• Classic Textbook Example of Data Mining (Legend?):

Data mining of grocery store logs indicated that men who buy diapers also tend to buy beer at the same time.

• Blockbuster Entertainment mines its video rental history database to recommend rentals to individual customers.

• A financial institution discovered that credit applicants who used pencil on the form were much more likely to default on their debts than those who filled out the application using ink.

• Credit card companies recommend products to cardholders based on analysis of their monthly expenditures.

• Airline purchase transaction logs revealed that 9-11 hijackers bought one-way airline tickets with the same credit card.

• Astronomers examined objects with extreme colors in a huge database to discover the most distant Quasars ever seen.

Page 39: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology
Page 40: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Data Mining Application:Marketing

Sales Analysis

• associations between product sales:• beer and diapers

• strawberry pop tarts and beer (and hurricanes)

Customer Profiling

• data mining can tell you what types of customers buy what products

Identifying Customer Requirements

• identify the best products for different customers

• use prediction to find what factors will attract new customers

Page 41: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Auto Insurance Fraud

• Association Rule Mining can detect a group of people who stage accidents to collect on insurance

Money Laundering• Since 1993, the US Treasury's Financial Crimes

Enforcement Network agency has used a data-mining application to detect suspicious money transactions

Banking: Loan Fraud• Security Pacific/Bank of America uses data mining to

help with commercial lending decisions and to prevent fraud

Data Mining Application:Fraud Detection

Page 42: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

The Necessity of Data Mining• Enormous interest in these data collections. • The environment to exploit these data does not

exist! – 1 Terabyte at 100 Mbits/sec takes 1 day to transfer.– Hundreds to thousands of queries per day.– Data will reside at multiple locations, in many different

formats.– Existing analysis tools do not scale to Terabyte data

collections.

• The need is acute! A solution will not just happen.

Page 43: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

What is Knowledge Discovery?• Knowledge discovery refers to “finding out new

knowledge about an application domain using data on the domain usually stored in a database.”– Application domains: scientific, customer purchase records,

computer network logs, web traffic logs, financial transactions, census data, basketball play-by-play histories, ...

• Why are Data Mining & Knowledge Discovery such hot topics? --- because of the enormous interest in these huge databases and their potential for new discoveries.

• In large databases, Data Mining and Knowledge Discovery come in two flavors:– Event-based mining– Relationship-based mining

Page 44: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Event-Based Mining(Event-based mining is based upon events or trends in data.)

Four distinct orthogonal categorizations:• Known events / known models - use existing models (descriptive

models) to locate known phenomena of interest either spatially or temporally within a large database.

• Known events / unknown models - use clustering properties of data to discover new relationships and patterns among known phenomena.

• Unknown events / known models - use known associations and relationships (predictive models) among parameters that describe a phenomenon to predict the presence of previously unseen examples of the same phenomenon within a large complex database.

• Unknown events / unknown models - use thresholds or trends to identify transient or otherwise unique ("one-of-a-kind") events and therefore to discover new phenomena. Serendipity!

Page 45: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Relationship-Based Data Mining (Based upon associations & relationships among data items)

• Spatial associations -- identify events or objects at the same physical spatial location, or at related locations (e.g., urban versus rural data).

• Temporal associations -- identify events or transactions occurring during the same or related periods of time (e.g., periodically, or N days after event X).

• Coincidence associations -- use clustering techniques to identify events that are co-located (that coincide) within a multi-dimensional parameter space.

Page 46: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

User Requirements for a Data Mining System(What features must a DM system have for users?)

• Cross-Identification - refers to the classical problem of associating the objects listed in one database to the objects listed in another.

• Cross-Correlation - refers to the search for correlations, tendencies, and trends between parameters in multi-dimensional data, usually across databases.

• Nearest-Neighbor Identification - refers to the general application of clustering algorithms in multi-dimensional parameter space, usually within a single database.

• Systematic Data Exploration - refers to the application of the broad range of event-based and relationship-based queries to one or more databases in the hope of making a serendipitous discovery of new events/objects or a new class of events/objects.

Page 47: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Representative Data Mining Architecture

<http://www.thearling.com/text/dmwhite/dmwhite.htm>

Page 48: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Data leads to Knowledge leads to Understanding

EXAMPLE :

Data = 00100100111010100111100 (stored in database)

Information = ages and heights of children (metadata)

Knowledge = the older children tend to be taller

Understanding = children’s bones grow as they get older

Data Information Knowledge Understanding / Wisdom!

Page 49: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Astronomy Example

Data:

Information (catalogs / databases):– Measure brightness of galaxies from image (e.g., 14.2 or 21.7)

– Measure redshift of galaxies from spectrum (e.g., 0.0167 or 0.346)

Knowledge:Hubble Diagram Redshift-Brightness

Correlation Redshift = Distance

Understanding: the Universe is expanding!!

(a) Imaging data (ones & zeroes) (b) Spectral data (ones & zeroes)

Page 50: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Goal of Data Mining

The end goal of data mining is not the data themselves, but the new knowledge and understanding that are revealed in the process = Business Intelligence (BI). (Remember what we said about the business community’s opinion of

D.M.)

This is why the research field is usually referred to as KDD = Knowledge Discovery in Databases.

Page 51: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

The Data Mining Process

The most important and time-consuming step is Cleaning the Data.

Page 52: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

ClusteringClassificationAssociationsNeural NetsDecision TreesPattern RecognitionCorrelation/Trend AnalysisPrincipal Component AnalysisRegression AnalysisOutlier/Glitch IdentificationVisualizationAutonomous AgentsSelf-Organizing Maps (SOM)Link (Affinity) Analysis

Data Mining Methods and Some Examples

Classify new data items usingthe known classes & groups

Find associations and patternsamong different data items

Organize information in the database based on relationships among key data descriptors

Identify linkages between data items

based on features shared in common

Find all groups and classes ofobjects represented in the data

Page 53: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Some Data Mining Techniques Graphically Represented

Self-Organizing Map (SOM)

Outlier (Anomaly) Dectection

Clustering

Link Analysis Decision Tree

Neural Network

Page 54: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Data Mining Technique: Clustering

In this case,three different groups (classes)of items were found amongall of the itemsin the data set.

Page 55: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Data Mining Technique: Decision Tree Classification

Question:

Should I play tennis today?

(I must really love tennis!)

Similar to game “20 questions”

Same technique used by bank loan officers

to identify good potential customers

versus poor customers.

Page 56: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Data Mining Technique:Association Rule Mining(Market Basket Analysis)

tran1 cust33 p2, p5, p8tran2 cust45 p5, p8, p11tran3 cust12 p1, p9tran4 cust40 p5, p8, p11tran5 cust12 p2, p9tran6 cust12 p9

transactio

n

id custo

mer

id products

bought

salesrecords:

• Trend (Rule): Products p5, p8 often bought together• Trend (Rule): Customer 12 likes product p9

Page 57: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Data Mining Algorithm: The SOM

Figure: The SOM (Self-Organizing Map) is one technique for organizing information in a database based upon links between concepts.

It can be used to find hidden relationships and patterns in more complex data collections, usually based on links between keywords or metadata.

Page 58: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Data Mining Application: Outlier Detection

Figure: The clustering of data clouds (dc#) within a multidimensional parameter space (p#).

Such a mapping can be used to search for and identify clusters, voids, outliers, one-of-kinds, relationships, and associations among arbitrary parameters in a database (or among various parameters in geographically distributed databases).

Page 59: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Link Analysis for Terrorist SNAFind all connections and relationships among known terrorists.

Page 60: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Data Mining Technology:Parallel Mining

Figure: Parallel Data Mining

The application of parallel computing resources and parallel data access (e.g., RAID) enables concurrent drill-downs into large data collections

Page 61: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Data Mining Methods Explained• Clustering: Group data items according to tight relationships.

• Classification: Assign data items to predetermined groups.

• Associations: Associate data with similar relationships. The beer-diaper example is an example of associative mining.

• Artificial Neural Networks (ANN): Non-linear predictive models that learn through training and resemble biological neural networks in structure.

• Decision Trees: Hierarchical sets of decisions, based upon rules, for rapid classification of a data collection.

• Sequential Patterns: Identify or predict behavior patterns or trends.

• Genetic Algorithms: Rapid optimization techniques that are based on the concepts of natural evolution.

• Nearest Neighbor Method: Classify a data item according to its nearest neighbors (records that are most similar).

• Rule induction: The extraction of useful if-then rules from data based on statistical significance.

• Data visualization: The illustration and visual interpretation of complex relationships in multidimensional data using graphics tools.

• Self-Organizing Map (SOM): Graphically organizes (in a 2-dimensional map) the information stored within a database based upon similarities and links between concepts. It can be used to find hidden relationships and patterns in more complex data collections.

Page 62: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Data Mining Techniques: techniques are based on Algorithms; techniques are used in Applications

Page 63: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

KDnuggets

Page 66: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

http://www.kdnuggets.com/polls/2004/data_mining_applications_industries.htm

Banking (29) ………………………... 13%Scientific data (20) …………………... 9%Direct Marketing/Fundraising (19) …. 9%Fraud Detection (19) ………………… 9%Bioinformatics/Biotech (18) …………. 8%Insurance (15) ………………………... 7%Medical/Pharma (15) ………………… 7%Telecommunications (12) …………… 6%eCommerce/Web (12) ………………. 6%Investment/Stocks (9) ……………….. 4%Manufacturing (9) ……………………. 4%Retail (9) ……………………………… 4%Security (8) …………………………… 4%Travel (2) ……………………………... 1%Entertainment/News (1) ………………0.5%Other (19) ……………………………... 9%

Poll of Users: Where do you currently apply data mining? (August 2004)

“Industries/fields where you currently apply data mining?” [216 votes total]

Page 67: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

The importance of metadata and their rules

• So we have all this mined or extracted data: what is it?• Label some of it and call it metadata

• You know what it is

• Make it available to others (if you can)

Tim Berners-Lee– inventor of the world wide web– Founder of the W3C

Presentation at Ted

Page 68: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

“Metadata is data about data”

Metadata (and Markup languages)

Metadata often is written in XML

Page 69: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Metadata is semi-structured data conforming to commonlyagreed upon models, providing operational interoperabilityin a heterogeneous environment

Page 70: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

What is metadata?Some simple definitions

‘Structured data about data’.• Dublin Core Metadata Initiative FAQ, 2005

– http://dublincore.org/resources/faq/

Machine-understandable information about Web resources or other things.

• Tim Berners-Lee, W3C, 1997– http://www.w3.org/DesignIssues/Metadata

Page 71: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

"Web resources or other things"

– HTML documents– digital images– databases– books– museum objects– archival records– metadata records

– Web sites– collections– services– physical places– people– organizations– “works”– formats– concepts– events

• Metadata might be "about"… anything!

Page 72: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

What is metadata?Towards a "functional" view

Data associated with objects which relieves their potential users of having to have full advance knowledge of their existence or characteristics.

• Lorcan Dempsey & Rachel Heery, "Metadata: a current view of practice and issues", 1998

– http://www.ukoln.ac.uk/metadata/publications/jdmetadata/

Structured data about resources that can be used to help support a wide range of operations.

• Michael Day, "Metadata in a Nutshell", 2001– http://www.ukoln.ac.uk/metadata/publications/nutshell/

Page 73: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

What might metadata "say"?

What is this called?

What is this about?

Who made this?

When was this made?

Where do I get (a copy of) this?

When does this expire?

What format does this use?

Who is this intended for?

What does this cost?

Can I copy this? Can I modify this?

What are the component parts of this?

What else refers to this?

What did "users" think of this?

(etc!)

Page 74: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

What operations/functions?

resource disclosure & discoveryresource retrieval, useresource management, including preservationverification of authenticityintellectual property rights managementcommercecontent-ratingauthentication and authorizationpersonalization and localization of services(etc!)

Page 75: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

What operations/functions?

Different functions : different metadataMetadata (and metadata standards) sometimes

classified according to function– Descriptive: primarily for discovery, retrieval– Administrative: primarily for management– Structural: relationships between component parts of

resources – Contextual: relationships between resources

No “one size fits all solution”!

Page 76: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Metadata importance

“data about data” is about as good as the definition gets...

As a data resource grows, metadata becomes more important

Lack of metadata has different consequences– documentation: metadata can be regenerated automatically,

or by hand– datasets, pictures: once lost, can be impossible to

regenerate

Page 77: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Types of Metadata

Descriptive– Discovery / description of objects

• Title, author, abstract, etc.

Structural– Storage & presentation of objects

• 1 pdf file, 1 ppt file, 1 LaTeX file, etc.

Administrative– Managing and preservation of objects

• Access control lists, terms and conditions, format descriptions, “meta-metadata”

See http://www.loc.gov/standards/metadata.html

Page 78: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Which View is Correct?

figure 1 from: http://www.dlib.org/dlib/january01/lagoze/01lagoze.html

Page 79: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Approaches to Metadata

from Ng, Park and Burnett, 1997 (also JASIS, 50(13))

http://www.scils.rutgers.edu/~sypark/asis.html

– library science: bibliographic control• “organizing the physical containers of information, by means

of bibliographical description, subject analysis, and classification notation construction, so that the container can be efficiently described, identified, located and retrieved”

– computer and information science: data management• “not only to store, access and utilize data effectively, but also

to provide data security, data sharing, and data integrity”

Page 80: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

DL Metadata Issues

Who provides metadata? – author? “publisher”? professional cataloger? extracted from

content?

Is metadata “integrated” with data?– related question: is metadata a first class object?

Formats!– which ones?

– extensible?

– paradox: the more powerful the format, the less likely it will be used...

Page 81: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Metadata Formats and Implementation

Use markup languages– Interoperable– Extensible– Robust

Permits advance search features

When online, the beginning of a semantic web!

Page 82: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

What we covered

• Methods and tools for making sense of data• Assists reasoning, decision making

• Data manipulation methods• Large data

• How metadata helps

Page 83: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology
Page 84: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Importance of Data“Data is not only important to science but also to the humanities.”“The sexy job in the next ten years will be ... to take data -- to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it.” -- Hal Varian (Economist, Berkeley & Google)“Elite American university students do not think big enough. That is exactly the complaint from some of the largest technology companies and the federal government. At the heart of this criticism is data.” -- New York Times“Statistical agencies face increased demand for data products, and the questions asked by our society are becoming increasingly complex and hard to measure. Meeting these challenges requires innovation [in] cognitive research, and economic and statistical modeling.” -- Roderick Little (Statistician, US Census and U Michigan)

Page 85: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Never too much Data

“Companies that manage their data well are 5% to 6% more productive.” NYTimes

Page 86: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Building DBPedia

Page 87: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Words of wisdom

• "We have confused information (of which there is too much) with ideas (of which there are too few)."– Paul Theroux

• "The great Information Age is really an explosion of non-information; it is an explosion of data ... it is imperative to distinguish between the two; information is that which leads to understanding."– R.S. Wurman in his book: Information Anxiety2

Page 88: Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology

Questions

• Role in information science of• Information (knowledge) extraction

• Data mining

• Metadata

• What next?