Top Banner
A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon University
73

A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

A Century Of Progress On Information Integration:

A Mid-Term Report

William W. CohenCenter for Automated Learning and Discovery

(CALD),Carnegie Mellon University

Page 2: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Information Integration

• Discovering information sources (e.g. deep web modeling, schema learning, …)

• Gathering data (e.g., wrapper learning & information extraction, federated search, …)

Queries

• Querying integrated information sources (e.g. queries to views, execution of web-based queries, …)

• Data mining & analyzing integrated information (e.g., collaborative filtering/classification learning using extracted data, …)

Linkage

• Cleaning data (e.g., de-duping and linking records) to form a single [virtual] database

Page 3: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

[Science 1959]

Record linkage: bringing together of two or more separately recorded pieces of information concerning a particular individual or family (Dunn, 1946; Marshall, 1947).

Page 4: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.
Page 5: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

…Very much like inverse document frequency (IDF) rule used in information retrieval.

Page 6: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Motivations for Record Linkage c. 1959

Record linkage is motivated by certain problems faced by a small number of scientists doing data analysis for obscure reasons.

Page 7: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

In 1954, Popular Mechanics showed its readers what a home computer might look like in 2004 …

Page 8: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Information integration in 1959• Many of the basic principles of modern integration work

are recognizable.– Fellegi and Sunter, "A theory for record linkage", Journal of the American

Statistical Society, 1969

• Manual engineering of distance features (e.g., last names as Soundex codes) that are then matched probabilistically.– DB1 + DB2 DB12 + Pr(matches) + elbowGrease DB12

• Applied to records from pairs of datasets– “Smallest possible scale” for integration (one one dimension)

• Computationally expensive– Relative to ordinary database operations

• Narrowly used– Only for scientists in certain narrow areas (e.g., public health)

• Where have we come to now, in 2005?– [Hector’s heckling “how to we know when we’re finished?”]

Page 9: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Ted Kennedy's “Airport Adventure” [2004]

Washington -- Sen. Edward "Ted" Kennedy said Thursday that he was stopped and questioned at airports on the East Coast five times in March because his name appeared on the government's secret "no-fly" list…Kennedy was stopped because the name "T. Kennedy" has been used as an alias by someone on the list of terrorist suspects.

“…privately they [FAA officials] acknowledged being embarrassed that it took the senator and his staff more than three weeks to get his name removed.”

Page 10: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Florida Felon List [2000,2004]

The purge of felons from voter rolls has been a thorny issue since the 2000 presidential election. A private company hired to identify ineligible voters before the election produced a list with scores of errors, and elections supervisors used it to remove voters without verifying its accuracy…

The new list … contained few people identified as Hispanic; of the nearly 48,000 people on the list created by the Florida Department of Law Enforcement, only 61 were classified as Hispanics.

Gov. Bush said the mistake occurred because two databases that were merged to form the disputed list were incompatible. … when voters register in Florida, they can identify themselves as Hispanic. But the potential felons database has no Hispanic category…

The glitch in a state that President Bush won by just 537 votes could have been significant — because of the state's sizable Cuban population, Hispanics in Florida have tended to vote Republican… The list had about 28,000 Democrats and around 9,500 Republicans…

Page 11: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Information dealing with such matters as violent crime, organized crime, fraud and other white-collar crime may take days to be shared throughout the law enforcement community, according to an FBI official.

The new software program was supposed to allow agents to pass along intelligence and criminal information in real time….

In a response contained in the inspector general's report, the FBI pointed to its Investigative Data Warehouse…that provides … access to 47 sources of counterterrorism data, including information from FBI files, other government agencies and open-source news feeds.

Page 12: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

..counter asymmetric threats by achieving total

information awareness…

Page 13: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Fishing in a sea of information

• Suppose you discover a pattern of events between a group of three people such that Pr(group is terroristCell | pattern) = 0.99999

• If you apply it to all three-person groups in the US, how many false positives will there be?

(250,000,000 * 50 * 49) * (1 – 0.99999) = _______ 612,500

And how many true positives?

Page 14: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Chinese Embassy Bombing [1999]• May 7, 1999: NATO bombs the Chinese Embassy in Belgrade with five

precision-guided bombs—sent to the wrong address—killing three.

“The Chinese embassy was mistaken for the intended target…located just 200 yards from the embassy. Reliance on an outdated map, aerial photos, and the extrapolation of the address of the federal directorate from number patterns on surrounding streets were cited … as causing the tragic error…despite the elaborate system of checks built-into the targeting protocol, the coordinates did not trigger an alarm because the three databases used in the process all had the old address.” [US-China Policy Foundation summary of the investigation]

“BEIJING, June 17 –– China today publicly rejected the U.S. explanation … [and] said the U.S. report ‘does not hold water.’” [Washington Post]

“The Chinese embassy was clearly marked on tourist maps that are on sale internationally, including in the English language. … Its address is listed in the Belgrade telephone directory…. For the CIA to have made such an elementary blunder is simply not plausible.” [World Socialist Web Site]

“Many observers believe that the bombing was deliberate…it if you believe that the bombing was an accident, you already believe in the far-fetched” [disinfo.com, July 2002].

Page 15: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Information integration in 2005

• Apparently, we still have work to do.• We fail to integrate information correctly

– “Ted Kennedy (D-MA)” ≠ “T. Kennedy, (T-IRA)”

• Crucial decisions are affected by these errors– Who can/can’t vote (felon list)– Where bombs are sent (Chinese embassy)

• Storing, linking, and analyzing information is a double-edged sword:– Loss of privacy and “fishing expeditions”

Page 16: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Information integration, 2005-2059

• Understanding sources of uncertainty, and propagating uncertainty to the end user.– “Soft” information integration– Driven by user’s goal and user’s queries

• Information integration for the great unwashed masses– Personal information, small stores of scientific

information, …• Using more information in linkage:

– Text, images, multiple interacting “hard” sources• Anonymous secure linkage; non-technical

limitations of how information can be combined, as well as distributed.

Page 17: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Information integration, 2005-2059

• Understanding sources of uncertainty, and propagating uncertainty to the end user.– “Soft” information integration– Driven by user’s goal and user’s queries

– When does a particular user believe that “X is the same thing as Y”? Does “the same thing” always mean the same thing?

– Is “X is the same entity as Y” always transitive?

Page 18: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

When are two entities the same?

• Bell Labs• Bell Telephone Labs• AT&T Bell Labs• A&T Labs• AT&T Labs—Research• AT&T Labs Research,

Shannon Laboratory• Shannon Labs• Bell Labs Innovations• Lucent Technologies/Bell

Labs Innovations

History of Innovation: From 1925 to today, AT&T has attracted some of the world's greatest scientists, engineers and developers…. [www.research.att.com]

Bell Labs Facts: Bell Laboratories, the research and development arm of Lucent Technologies, has been operating continuously since 1925… [bell-labs.com]

[1925]

Page 19: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

=

=

Bell Telephone Labs

Page 20: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

“Buddhism rejects the key element in folk psychology: the idea of a self (a unified personal identity that is continuous through time)…

King Milinda and Nagasena (the Buddhist sage) discuss … personal identity… Milinda gradually realizes that "Nagasena" (the word) does not stand for anything he can point to: … not … the hairs on Nagasena's head, nor the hairs of the body, nor the "nails, teeth, skin, muscles, sinews, bones, marrow, kidneys, ..." etc… Milinda concludes that "Nagasena" doesn't stand for anything… If we can't say what a person is, then how do we know a person is the same person through time? …

There's really no you, and if there's no you, there are no beliefs or desires for you to have… The folk psychology picture is profoundly misleading and believing it will make you miserable.” -S. LaFave

When are two entities are the same?

Page 21: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Passing linkage decisions along to the user

Usual Goal: link records and create a single highly accurate database for users query.

• Equality is often uncertain, given available information about an entity

– “name: T. Kennedy occupation: terrorist”

• The interpretation of “equality” may change from user to user and application to application

– Does “Boston Market” = “McDonalds” ?

– Alternate goal: wait for a query, then answer it, propogating uncertainty about linkage decisions on that query to the end user

Page 22: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

LinkageQueries

Traditional approach:

Uncertainty about what to linkmust be decided by the integration

system, not the end user

Page 23: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Link items asneeded by Q

WHIRL vision:

Query Q

SELECT R.a,S.a,S.b,T.b FROM R,S,T

WHERE R.a=S.a and S.b=T.b

R.a S.a S.b T.b

Anhai Anhai Doan Doan

Dan Dan Weld Weld

Strongest links: those agreeable to most users

William Will Cohen Cohn

Steve Steven Minton Mitton

Weaker links: those agreeable to some users

William David Cohen Cohneven weaker links…

Page 24: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Link items asneeded by Q

WHIRL vision:

Query Q

SELECT R.a,S.a,S.b,T.b FROM R,S,T

WHERE R.a~S.a and S.b~T.b (~ TFIDF-similar)

R.a S.a S.b T.b

Anhai Anhai Doan Doan

Dan Dan Weld Weld

Incrementally produce a ranked list of possible links,

with “best matches” first. User (or downstream process)

decides how much of the list to generate and examine.

William Will Cohen Cohn

Steve Steven Minton Mitton

William David Cohen Cohn

DB1 + DB2 ≠ DB

Page 25: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

WHIRL queries• Assume two relations:

review(movieTitle,reviewText): archive of reviews

listing(theatre, movieTitle, showTimes, …): now showing

The Hitchhiker’s Guide to the Galaxy, 2005

This is a faithful re-creation of the original radio series – not surprisingly, as Adams wrote the screenplay ….

Men in Black, 1997

Will Smith does an excellent job in this …

Space Balls, 1987

Only a die-hard Mel Brooks fan could claim to enjoy …

… …

Star Wars Episode III

The Senator Theater

1:00, 4:15, & 7:30pm.

Cinderella Man

The Rotunda Cinema

1:00, 4:30, & 7:30pm.

… … …

Page 26: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

WHIRL queries

• “Find reviews of sci-fi comedies [movie domain]

FROM review SELECT * WHERE r.text~’sci fi comedy’

(like standard ranked retrieval of “sci-fi comedy”)

• “ “Where is [that sci-fi comedy] playing?”FROM review as r, LISTING as s, SELECT *

WHERE r.title~s.title and r.text~’sci fi comedy’

(best answers: titles are similar to each other – e.g., “Hitchhiker’s Guide to the Galaxy” and “The Hitchhiker’s Guide to the Galaxy, 2005” and the review text is similar to “sci-fi comedy”)

Page 27: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

WHIRL queries• Similarity is based on TFIDF rare words are most important.• Search for high-ranking answers uses inverted indices….

The Hitchhiker’s Guide to the Galaxy, 2005

Men in Black, 1997

Space Balls, 1987

Star Wars Episode III

Hitchhiker’s Guide to the Galaxy

Cinderella Man

Page 28: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

WHIRL queries• Similarity is based on TFIDF rare words are most important.• Search for high-ranking answers uses inverted indices….

The Hitchhiker’s Guide to the Galaxy, 2005

Men in Black, 1997

Space Balls, 1987

Star Wars Episode III

Hitchhiker’s Guide to the Galaxy

Cinderella Man

Years are common in the review archive, so have low weight

hitchhiker movie00137

the movie001,movie003,movie007,movie008, movie013,movie018,movie023,movie0031,

…..

- It is easy to find the (few) items that match on “important” terms

- Search for strong matches can prune “unimportant terms”

Page 29: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

WHIRL results

• This sort of worked:– Interactive speeds

(<0.3s/q) with a few hundred thousand tuples.

– For 2-way joins, average precision (sort of like area

under precision-recall curve) from 85% to 100% on 13 problems in 6 domains.

– Average precision better than 90% on 5-way joins

Page 30: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

WHIRL and soft integration

• WHIRL worked for a number of

web-based demo applications.– e.g., integrating data from 30-50

smallish web DBs with <1 FTE labor

• WHIRL could link many data types reasonably well, without engineering

• WHIRL generated numerous papers (Sigmod98, KDD98, Agents99, AAAI99, TOIS2000, AIJ2000, ICML2000, JAIR2001)

• WHIRL was relational– But see ELIXIR (SIGIR2001)

• WHIRL users need to know schema of source DBs

• WHIRL’s query-time linkage worked only for TFIDF, token-based distance metrics Text fields with few

misspellimgs

• WHIRL was memory-based– all data must be centrally stored

—no federated data. small datasets only

Page 31: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Link items asneeded by Q

WHIRL vision: very radical, everything was inter-dependent

Query Q

SELECT R.a,S.a,S.b,T.b FROM R,S,T

WHERE R.a~S.a and S.b~T.b (~ TFIDF-similar)

R.a S.a S.b T.b

Anhai Anhai Doan Doan

Dan Dan Weld Weld

Incrementally produce a ranked list of possible links,

with “best matches” first. User (or downstream process)

decides how much of the list to generate and examine.

William Will Cohen Cohn

Steve Steven Minton Mitton

William David Cohen Cohn

Page 32: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Information integration, 2005-2059

• Understanding sources of uncertainty, and propagating uncertainty to the end user.– “Soft” information integration– Driven by user’s goal and user’s queries

• Information integration for the great unwashed masses– Personal information, small stores of scientific

information, …• Using more information in linkage:

– Text, images, multiple interacting “hard” sources• Anonymous secure linkage; non-technical

limitations of how information can be combined, as well as distributed.

Page 33: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

0

50

100

150

200

250

PublishBook

PublishWebsite

PublishBlog

Series1

0

50

100

150

200

250

PublishCorporate

Data

PublishScientific

Data

PublishPersonal

Data

Series1

Page 34: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Information integration, 2005-2059• Understanding sources of uncertainty, and

propagating uncertainty to the end user.– “Soft” information integration– Driven by user’s goal and user’s queries

• Information integration for the great unwashed masses– Personal information, small stores of scientific

information, …

• Needed:– Robust distance metrics that work “out of the box”– Methods to tune and combine these metrics

Page 35: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Robust distance metrics for strings

• Kinds of distances between s and t:– Edit-distance based (Levenshtein, Smith-

Waterman, …): distance is cost of cheapest sequence of edits that transform s to t.

– Term-based (TFIDF, Jaccard, DICE, …): distance based on set of words in s and t, usually weighting “important” words

– Which methods work best when?

Page 36: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Robust distance metrics for strings

• Java toolkit of string-matching methods from AI, Statistics, IR and DB communities

• Tools for evaluating performance on test data• Used to experimentally compare a number of metrics

SecondString (Cohen, Ravikumar, Fienberg, IIWeb 2003):

Page 37: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Results: Edit-distance variants Monge-Elkan (a carefully-tuned Smith-Waterman variant) is the best on average

across the benchmark datasets…

11-pt interpolated recall/precision curves averaged across 11 benchmark problems

Page 38: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Results: Edit-distance variants But Monge-Elkan is sometimes outperformed on specific datasets

Precision-recall for Monge-Elkan and one other method (Levenshtein) on a specific benchmark

Page 39: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

SoftTFDF: A robust distance metric

We also compared edit-distance based and term-based methods, and evaluated a new “hybrid” method:

SoftTFIDF, for token sets S and T:• Extends TFIDF by including pairs of words in S and T that “almost” match—i.e., that are highly similar according to a second distance metric (the Jaro-Winkler metric, an edit-distance like metric).

Page 40: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Comparing token-based, edit-distance, and hybrid distance metrics

SFS is a vanilla IDF weight on each token (circa 1959!)

Page 41: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

SoftTFIDF is a Robust Distance Metric

Page 42: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Information integration, 2005-2059

• Understanding sources of uncertainty, and propagating uncertainty to the end user.– “Soft” information integration– Driven by user’s goal and user’s queries

• Information integration for the great unwashed masses– Personal information, small stores of scientific

information, …• Needed:

– Robust distance metrics that work “out of the box”– Methods to tune and combine these metrics

Page 43: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Tuning and combining distance metrics using learning

• Why use a single distance metric? – Can you learn how to combine several distance

metrics—either distances across several fields (name, address) or several ways of measuring distance on the same field (edit distance, TFIDF?)

• [Bilenko & Mooney, KDD2003; Ravikumar, Cohen & Fienberg, UAI 2004]

– Can you learn the (many) parameters of an edit distance metric? (e.g., what is the cost of replacing “M” with “N” vs “M” with “V”?)

• [Ristad and Yianolis, PAMI’98; Bilenko & Mooney, KDD 2003]: learning edit distances using pair HMMs

Page 44: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Tuning and combining distance metrics using learning

• Pair HMM: a probabilistic automata that randomly emits a pair of letters at each clock tick.

• A sequence of letter pairs corresponds to a pair of strings:COHEN, COH_M = COHEN, COHM

• The parameters of the HMM can be tuned with extensions of the standard learning methods for HMMS– Training data is pairs of “true

matches”, e.g. Cohen/Cohm

(A,A) 0.002

(A,B) 0.000001

… …

(M,M) 0.0012

(M,N) 0.0003

… …

(M,X) 0.0000001

(M,_) 0.007

… …

A 1-state pair HMM

(C,C),(O,O),(H,H),(E,_),(N,M),…

Page 45: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Tuning and combining distance metrics using learning

(A,A) 0.002

(A,B) 0.000001

… …

(M,M) 0.0012

(M,N) 0.0003

… …

Ristad & Yianolis, 98

Bilenko & Mooney, 2003

Levenshtein-like edit distance Smith-Waterman-like “affine gap”edit distance

Page 46: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Tuning and combining distance metrics using learning

• Traditionally, multiple similarity measurements are combined using learning methods:– E.g., by clustering using a latent

binary “Match?” variable

– Independence assumptions are usually inappropriate: (e.g., same-address same-last-name)

• How to model dependencies?– Structural EM on a limited model

(Ravikumar, Cohen, Fienberg, UAI2004)

M

F1 F2 F3 F4

F1: JaroWinkler(name1,name2)

F2: Levenshtein(addr1,addr2)

Fk: SoftTFIDF(name1,name2)

Page 47: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Tuning and combining distance metrics using learning

• Traditionally, multiple similarity measurements are combined using learning methods:– E.g., by clustering using a latent

binary “Match?” variable

– Independence assumptions are usually inappropriate: (e.g., same-address same-last-name)

• How to model dependencies?– Structural EM on a limited model

(Ravikumar, Cohen, Fienberg, UAI2004)

F1: JaroWinkler(name1,name2)

F2: Levenshtein(addr1,addr2)

Fk: SoftTFIDF(name1,name2)

X1 X2 X3

F1 F2 F3

X4

F4

Mlatent

binaryvariables

dependencies allowed

monotone

fixedrelation

Page 48: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Tuning and combining distance metrics using unsupervised structural EM

Page 49: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Robust distance metrics, learnable using generative models (semi-/unsupervised learning)

• Summary:– None of these methods are

evaluated on as many integration problems as one would like

– Ravikumar et al structural EM method works well, but is computationally expensive

– Pair-HMM methods of Ristad & Yianolis, Bilenko & Mooney work well, but require “true matched pairs”

– Claim: could combine these by using pair-HMMs in inner loop of structural EM

• Practical well before 2059

X1 X2 X3

F1 F2 F3

X4

F4

Mlatent

binaryvariables

dependencies allowed

fixedrelation

Page 50: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Information integration, 2005-2059• Understanding sources of uncertainty, and

propagating uncertainty to the end user.– “Soft” information integration– Driven by user’s goal and user’s queries

• Information integration for the great unwashed masses– Personal information, small stores of scientific

information, …• Needed:

– Robust distance metrics that work “out of the box”– Methods to tune and combine these metrics– Ways to rapidly integrate new information sources with

unknown schemata

Page 51: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

BANKS: Basic Data Model

• Database is modeled as a graph– Nodes = tuples– Edges = references between tuples

• foreign key, inclusion dependencies, ..

• Edges are directed.

MultiQuery Optimization

S. Sudarshan

Prasan Roy

writes

author

paper

Charuta

BANKS: Keyword search…

User need not know organization of database to formulate queries.

Page 52: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

BANKS: Answer to Query

Query: “sudarshan roy” Answer: subtree from graph

MultiQuery Optimization

S. Sudarshan

Prasan Roy

writes writes

author author

paper

Page 53: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Other work on “structured search without schemata”

• BANKS (Browsing aNd Keyword Search): [Chakrabarti & others, VLDB 02, VLDB 05] – discussed above

• DataSpot (DTL)/Mercado Intuifind: [VLDB 98]• Proximity Search: [VLDB98]• Information units (linked Web pages): [WWW10]• Microsoft DBExplorer, Microsoft English query

QueriesLinkage

Discovery, Gathering, UnderstandingCombining BANKS-like search and WHIRL-like linkage combining DBs without schema understanding or data cleaning

Learning from user queries

Results expected

before 2059:

Page 54: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

• Given a group of strings that refer to some external entities, cluster them so that strings in the same cluster refer to the same entity.

Bart Selmann

B. Selman

Bart Selman

Cornell Univ.

Critical behavior in satisfiability

Critical behavior for satisfiability

BLACKBOX theorem proving

Using multiple relations: “database hardening”

Page 55: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Using multiple relations: “database hardening”

Database S1 (extracted from paper 1’s title page):

Database S2 (extracted from paper 2’s bibliography):

Assumption: identical strings from the same source are co-referent (one sense per discourse?)

Page 56: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

So this gives some known matches, which might interact with proposed matches: e.g. here we deduce...

Using multiple relations: “database hardening”

Page 57: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

“Soft database” from IE:

“Hard database” suitable for Oracle, MySQL, etc

Page 58: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

• (McAllister et al, KDD2000) Defined “hardening”:– Find “interpretation” (maps variant->name) that produces a

compact version of “soft” database S.– Probabilistic interpretation of hardening:

• Original “soft” data S is version of latent “hard” data H. • Hardening finds max likelihood H.

– Hardening is hard!• Optimal hardening is NP-hard.

– Greedy algorithm:• naive implementation is quadratic in |S|• clever data structures make it P(n log n), where n=|S|d

• Other related work:– Pasula et al, NIPS2002: more explicit generative Bayesian

formulation and MCMC method, experimental support– Wellner & McCallum 2004, Parag & Domingos 2004, Culotta &

McCallum 2005: discriminative models for multiple-relation linkage

Using multiple relations: “database hardening”

Page 59: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Information integration, 2005-2059

• Understanding sources of uncertainty, and propagating uncertainty to the end user.– “Soft” information integration– Driven by user’s goal and user’s queries

• Information integration for the great unwashed masses– Personal information, small stores of scientific

information, …

• Using more information in linkage:– Text, images, multiple interacting “hard” sources

Page 60: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Using free text:Information integration meets information extraction

QueriesLinkage

Traditionally, information extraction is viewed as “downstream” of integration—i.e., a separate process

When CraigKnoblock was here,he said that ISI wasworking on a wayto automatically…

Craig Knoblock

ISI

Alon Halevey University of Washington

… …

Information extraction

Page 61: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Using free text:Information integration meets information extraction

QueriesLinkage

Question: Can information from other databases (to which the extracted data will be linked) be used to help IE?

This is especially important if we care a lot about what the joins will be…

When CraigKnowblock was here, he said that ISI was working on a Way to automatically…

Craig Knowblock ISI

Alon Halevey University of Washington

… …

Information extraction

Craig A. Knoblock FETCH

Alon Y. Halevey NIMBLE

Thomas A. Edison Edison Power & Lighting

Page 62: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Using free text:Information integration meets information extraction

When CraigKnowblock was here, he said that ISI was working on a Way to automatically…

Craig Knowblock

Alon Halevey

Craig A. Knoblock

Alon Y. Halevey

Thomas A. Edison

Simplified question: how much can NER (named entity recognition) be improved using a list of known names?

What if I just care about the JOIN of the two lists of names? I.e., the goal is to “join” the blue table with the orange text?

Page 63: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Information Extraction with HMM-like methods

hall 0.11

center 0.06

bldg 0.02

... ...

House number Building Road City Zip

4089 Whispering Pines Nobel Drive San Diego CA 92122

State

Page 64: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

State-of-the art in named entity recognition: HMM-like sequential word classifiers

I say Prof. F. Douglas at the San Diego zoo.

I saw

Person name: Craig Knoblock

Given a sequence of observations:

…and a trained HMMor CRF:

…find the most likely state sequence (Viterbi)

Any words determined to be generated by a “person name”state are extracted as a person name:

person name

location name

background

Prof. F. Douglas at the San Diego zoo

Page 65: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Features for information extraction

I met Prof. F. Douglas at the zoo

1 2 3 4 5 6 7 8

I met Prof F. Douglas at the zoo.

Other Other Person Person Person other other Location

t

x

y

Question: how can we guide this using a

dictionary D?

Simple answer: make membership in D a feature fd

Page 66: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Existing Markov models for IE• Feature vector for each position

• Examples

• Parameters: weight W for each feature (vector)

i-th labelWord i & neighbors

previous label

Page 67: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Problem

• Exact match to entry in D cannot handle abbreviations, misspellings, etc

Exploit state-of-the-art similarity metrics

edit distance, TFIDF match

Use similarity measures as feature values

Single-word classification prevents effective use of multi-word entities in dictionaries

Classify multi-word segments

Use Semi-Markov models

Proposed solution

Page 68: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Semi-markov models for information extraction

1 2 3 4 5 6 7 8

I met Prof. F. Douglas

at the zoo.

Other Other Person Person Person other other Location

l1=u1=1 l2=u2=2 l3=3, u3=5 l4=6,u4=6 l5=u5=7

l6=u6=8

I met Prof. F. Douglas at the zoo.

Other Other Person other other Location

t

x

y

l,u

x

y

COST: Requires additional search in ViterbiLearning and inference slower by O(maxNameLength)

Page 69: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Features for Semi-Markov models

j-th label Start of Sj previous

labelend of Sj

Page 70: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Internal dictionary: formed from training examples

External dictionary: from some column of an external DB

Page 71: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Using free text:Information integration meets information extraction

When CraigKnowblock was here, he said that ISI was working on a Way to automatically…

Craig Knowblock

Alon Halevey

Craig A. Knoblock

Alon Y. Halevey

Thomas A. Edison

Simplified question: how much can NER (named entity recognition) be improved using a list of known names? How well can we do at “joining to free text”?

Recent work:

• Kou et al, ISMB2005

• Cohen & Sarawagi, KDD2004

• Sarawagi & Cohen, NIPS2004

• Wellner et al, UAI2004

• Bunescu & Mooney, ACL2004

• Pasula et al, NIPS2003

• ….

Page 72: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Correspondence LDA

• Generative process

–Image first

–Caption second

• Caption topics are a subset of the image topics

• Enforces a correspondence between image segments and words that are associated with them.

Page 73: A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Information Integration: today and tomorrow

• Discovering information sources: based on standards and free-text metadata.

• Data providers will be even more numerous.

• Gathering data: will get cheaper and cheaper

Queries

• Querying integrated information sources may be done in radically different query models

• Data mining & analyzing integrated information will be the norm, not the exception

Linkage

• Cleaning data to form a single virtual database will be guided by a user or group of users, and by characteristics of all the data