Top Banner
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos
73

Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

Nov 03, 2019

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

http://poloclub.gatech.edu/cse6242CSE6242 / CX4242: Data & Visual Analytics

Data Integration

Duen Horng (Polo) Chau Assistant ProfessorAssociate Director, MS AnalyticsGeorgia Tech

Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

Page 2: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What is Data Integration? Why is it Important?

Page 3: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

3

Data IntegrationCombining data from different sources to provide the user with a unified view

How to help people effectively leverage multiple data sources? (People: analysts, researchers, practitioners, etc.)

Page 4: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

Examples businesses that derive value via

data integration

Page 5: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and
Page 6: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and
Page 7: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

Craigslist now has map view! What problem has it solved?

https://atlanta.craigslist.org/search/hhh

Page 8: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and
Page 9: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

More Examples• Amazon (products): music/movie info

• Facebook: pages from wikipedia (google)

• mint.com: bank/credit card accounts info

• flipboard/google news/apple news: articles

• …

9

Page 10: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

How to do data integration?

Page 11: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

“Low” Effort Approaches1. Use database’s “Join”! (e.g., SQLite)When does this approach work? (Or, when does it NOT work?)

11

id name state111 Smith GA222 Johnson NY333 Obama CA

id name111 Smith222 Johnson333 Obama

id state111 GA222 NY333 CA

2. Google Refine http://openrefine.org (video #3)

Page 12: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

So, it’s great to assign an ID to everything!

But how?

12

Page 13: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

Crowd-sourcing Approaches: Freebase

13http://wiki.freebase.com/wiki/What_is_Freebase%3F

Freebase intro: https://www.youtube.com/watch?v=TJfrNo3Z-DUFreebase to move over to Wikidata in July (2015): http://goo.gl/3ZDTg7

Page 14: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

Freebase(a graph of entities)

“…a large collaborative knowledge base consisting of metadata composed mainly

by its community members…”

14

Wikipedia.

Page 15: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

So what? What can you do with Freebase?

Hint: Google acquired it in 2010

15

Page 16: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

https://www.youtube.com/watch?v=mmQl6VGvX-c

Page 17: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

https://www.facebook.com/about/graphsearchhttps://www.youtube.com/watch?v=W3k1USQbq80

Page 18: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

FeldsparFinding Information by Association

Polo Chau, Brad Myers, Andrew Faulring

CHI 2008

18Paper: http://www.cs.cmu.edu/~dchau/feldspar/feldspar-chi08.pdfYouTube: http://www.youtube.com/watch?v=Q0TIV8F_o_E&feature=youtu.be&list=ULQ0TIV8F_o_E

Page 19: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 19

Feldspar

Page 20: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 19

Feldspar

A system that helps people find things on their computers when typical search or browsing tools don’t work

An example scenario…

Page 21: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 20

“Find the webpage mentioned in the email from the person I met at an event“

Page 22: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 20

If I can’t remember the specifics, such as any text in the webpage, email, etc. à Can’t search

“Find the webpage mentioned in the email from the person I met at an event“

Page 23: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 20

If I can’t remember the specifics, such as any text in the webpage, email, etc. à Can’t search

If I haven’t bookmarked the webpage à Can’t browse

“Find the webpage mentioned in the email from the person I met at an event“

Page 24: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 21

“Find the webpage mentioned in the email from the person I met at an event“

Page 25: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 21

But I can describe the webpage with a chain of associations.

“Find the webpage mentioned in the email from the person I met at an event“

Page 26: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 21

But I can describe the webpage with a chain of associations.

webpage – email – person – event

“Find the webpage mentioned in the email from the person I met at an event“

Page 27: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 21

But I can describe the webpage with a chain of associations.

webpage – email – person – event

The psychology literature has shown that people often remember things exactly like this.

“Find the webpage mentioned in the email from the person I met at an event“

Page 28: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 22

Natural question: Can I find things by associations?

Page 29: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 22

Natural question: Can I find things by associations?

Can I find the webpage by specifying its associated information (email, person, and event)?

Page 30: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 22

Natural question: Can I find things by associations?

Can I find the webpage by specifying its associated information (email, person, and event)?

We created Feldspar, which supports this associative retrieval of information.

Page 31: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 23

Feldspar stands for….

Page 32: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 23

FELD S PA R

Feldspar stands for….

Finding Elements byLeveraging Diverse Sources of PertinentAssociative Recollection

Page 33: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 24

DEMOYouTube: http://www.youtube.com/watch?v=Q0TIV8F_o_E&feature=youtu.be&list=ULQ0TIV8F_o_E

Page 34: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 25

Implementation: Overview

Create a graph database to store the associations among items on the computer

Develop an algorithm that processes the query and returns results

Page 35: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 26

Creating an Association Database (a graph)

Install Google Desktop and let it index all the items on the computer

Page 36: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 26

Creating an Association Database (a graph)

Focus on 7 types

Install Google Desktop and let it index all the items on the computer

Page 37: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 26

Creating an Association Database (a graph)

Focus on 7 types

Install Google Desktop and let it index all the items on the computer

Identify associations and build our database, which is a directed graph

Page 38: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 26

Creating an Association Database (a graph)

Focus on 7 types

Install Google Desktop and let it index all the items on the computer

Identify associations and build our database, which is a directed graph

Page 39: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 26

Creating an Association Database (a graph)

Focus on 7 types

Install Google Desktop and let it index all the items on the computer

Identify associations and build our database, which is a directed graph

Page 40: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 27

Algorithm that processes the query

Page 41: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 27

Algorithm that processes the query

Page 42: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 27

Algorithm that processes the query

Page 43: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 28

Algorithm that processes the query

Page 44: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 28

Algorithm that processes the query

Sample association graph database

Page 45: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 28

Algorithm that processes the query

Sample association graph database

Results Generator

Page 46: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 28

Algorithm that processes the query

Sample association graph database

Results Generator

Page 47: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 28

Algorithm that processes the query

Sample association graph database

Results Generator

Page 48: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 28

Algorithm that processes the query

One Results Generator for each pair of associations For 7 data types à needs 7x7=49 results generators

Sample association graph database

Results Generator

Page 49: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 29

Usability Study: Questions to Answer

Do people understand how to use Feldspar?

Does Feldspar perform well on complicated tasks (which are difficult if not using Feldspar)?

Does Feldspar also works well for simple everyday tasks?

Page 50: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 30

Usability Study: Study Design

Page 51: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 30

Usability Study: Study Design

Within-subject design

Two groups of software

FeldsparControl: conventional desktop applications, including Outlook and its built-in browsing and querying mechanisms, Google Desktop, and the Windows Explorer

Two similar task setsTask set A and B Each set has 7 tasksTasks ask people to find things on a computer

Page 52: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 31

Usability Study: Study Design & Participants

Page 53: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 31

Usability Study: Study Design & Participants

Four conditions, counter-balanced for software order and task set order

(Feldspar + Task Set A) then (Control + Task Set B)(Feldspar + Task Set B) then (Control + Task Set A)(Control + Task Set A) then (Feldspar + Tasks Set B)(Control + Task Set B) then (Feldspar + Tasks Set A)

Page 54: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 31

Usability Study: Study Design & Participants

Four conditions, counter-balanced for software order and task set order

(Feldspar + Task Set A) then (Control + Task Set B)(Feldspar + Task Set B) then (Control + Task Set A)(Control + Task Set A) then (Feldspar + Tasks Set B)(Control + Task Set B) then (Feldspar + Tasks Set A)

8 participants, two in each conditions

Page 55: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 32

User Study: Task Set A

1. Open the last email received on July 27, 2007

2. Open all the email attachments of type .txt

3. Find out who had email conversations with the person who sent out the file file.doc

4. Find out who attended the event in which Cara was present.

5. Find all the events that were attended by anyone who has sent you a file.

6. Open the file folders that contain email attachments from Spence.

7. Open the webpage mentioned in the email from the person you met in an event in May.

Details

Page 56: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 32

User Study: Task Set A

1. Open the last email received on July 27, 2007

2. Open all the email attachments of type .txt

3. Find out who had email conversations with the person who sent out the file file.doc

4. Find out who attended the event in which Cara was present.

5. Find all the events that were attended by anyone who has sent you a file.

6. Open the file folders that contain email attachments from Spence.

7. Open the webpage mentioned in the email from the person you met in an event in May.

Details

Page 57: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 32

User Study: Task Set A

1. Open the last email received on July 27, 2007

2. Open all the email attachments of type .txt

3. Find out who had email conversations with the person who sent out the file file.doc

4. Find out who attended the event in which Cara was present.

5. Find all the events that were attended by anyone who has sent you a file.

6. Open the file folders that contain email attachments from Spence.

7. Open the webpage mentioned in the email from the person you met in an event in May.

Details

Page 58: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 33

Quantitative Results

Average completion time for each task (in seconds)

Number of fails for each task

Statistically significant tasks marked by * (p<0.05)Shorter bars are better

Page 59: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 34

Qualitative Results

Page 60: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What to Do When Search Fails: Finding Information by Association Polo Chau, Brad Myers, Andrew Faulring 35

Conclusions

Feldspar

Provides a usable interface for finding desktop information by interactively and incrementally specifying multiple levels of associations;Builds graph database for storing associations, identified from Google Desktop;Uses real-time algorithms to provide answers to queries.

Feldspar could be a useful addition to conventional search and browsing tools.

Page 61: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

What if we don’t have the luxury of having IDs ?

36(Screenshot from FreeBase video)

A common problem in academia:Polo ChauDuen Horng ChauDuen ChauD. Chau

Page 62: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

Entity Resolution(A hard problem in data integration)

37

Then you need to do…

Page 63: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

Why is entity resolution important?

Case Study Let’s shop for an iPhone 6 on

Apple, Amazon and eBay

Page 64: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and
Page 65: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and
Page 66: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and
Page 67: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

D-Dupe Interactive Data Deduplication and IntegrationTVCG 2008University of MarylandBilgic, Licamele, Getoor, Kang, Shneiderman

42http://www.cs.umd.edu/projects/linqs/ddupe/ (skip to 0:55)http://linqs.cs.umd.edu/basilic/web/Publications/2008/kang:tvcg08/kang-tvcg08.pdf

Page 68: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and
Page 69: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

Polo

Paolo

Alice

Bob

Carol

Dave

Page 70: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

Numerous similarity functions

• Euclidean distanceEuclidean norm / L2 norm

• TaxiCab/Manhattan distance

• Jaccard Similarity (e.g., used with w-shingles)e.g., overlap of nodes’ #neighbors

• String edit distancee.g., “Polo Chau” vs “Polo Chan”

• Canberra distance (compare ranked items)

45

http://infolab.stanford.edu/~ullman/mmds/ch3a.pdfExcellent read:

Page 71: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

46

https://reference.wolfram.com/language/guide/DistanceAndSimilarityMeasures.html

Page 72: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

Core components: Similarity functions

Determine how two entities are similar.D-Dupe’s approach: Attribute similarity + relational similarity

47

Similarity score for a pair of entities

Page 73: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-4-Integrate.pdf · Data Integration Duen Horng ... Install Google Desktop and

48

Attribute similarity (a weighted sum)