Top Banner
Effective Management and Exploration of Scientific Data on the Web. Lena Strömbäck [email protected] Linköping University
47

Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck [email protected] Linköping University 2 Internet

Apr 29, 2018

Download

Documents

vuongxuyen
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

Effective Management and Explorationof Scientific Data on the Web.

Lena Strömbä[email protected]

Linköping University

Page 2: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

2

Internet

Page 3: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

3

Example: New York Times

Page 4: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

4

Example: Baby Name VizardLaura Wattenberg – Generation Grownup

How is it used?

What are the problems?

Page 5: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

5

Example: Many EyesIBM Research and the IBM Cognos software group

How is it used?

What are the problems?

Page 6: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

6

E-Science data

Complex data

Not easily human interpretable

Need for integration and comparison

Powerful computation needed

Page 7: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

7

To further complicate the task

Standardization and agreement of commonformats is a prerequisite for efficient datamanagement

The Web is an ad-hoc platform where newdata formats and actors occurs all the time

Page 8: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

8

Content of this presentation:

Two scientific application areas

Provenance/Scientific workflows

Bioinformatics

Three different aspects

Interfaces for exploration

Seamless data integration

Effective data exploration

Page 9: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

9

Content of this presentation:

Two scientific application areas

Provenance/Scientific workflows

Bioinformatics

Three different aspects

Interfaces for exploration

Seamless data integration

Effective data exploration

Page 10: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

10

Biological data

Secondary str. Taxonomy

DNA seq.

INSULIN

Tertiary str.

Signaling pathwayProtein seq.

PDB

GenBank

AmiGO

SPAD

PROSITE

SWISS-PROT

Page 11: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

11

Capturing provenance

Provenance of scientific artifacts is necessary toreproduce, validate and share scientific results

Provenance can be as important as the results!

Page 12: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

12

Scientific workflows and provenance –capturing biological data integration

Page 13: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

13

Page 14: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

14

Scientific workflows

Advantage of workflows

Easy to edit

Reusable

Sharable

Reusing workflows

Large collections have become available

How to take advantage of this information?

Finding specific workflows

Workflow Search Engines

Workflow Query Languages

14

Page 15: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

15

Content of this presentation:

Two scientific application areas

Provenance/Scientific workflows

Bioinformatics

Three different aspects

Interfaces for exploration

Seamless data integration

Effective data exploration

Page 16: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

16

June 17, 2009

Issues in workflow search

• Different types of search methods

– Keywords

– Structured queries – workflow query language

– Workflow similarity clustering

• Capturing the user intent

• How to rank results

– Calculate most relevant workflow from a user query

• How to display result

– Workflow snippets, descriptions, thumbnails

Page 17: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

17

June 17, 2009

Workflow snippets – state of the art

• Emphasis on meta-data

• Low quality when information is insufficient or absent

Page 18: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

18

June 17, 2009

Important features

Page 19: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

19

Requirements for snippets

• Self-contained

– A snippet should contain the context of a keyword

• Representative

– The user should be able to grasp the essence of the result from itssnippet.

• Distinguishable

– The snippet should make the corresponding query result distinguishablefrom other results

• Small

– A snippet should be small so that it is easy to browse several results

Huang, Liu and Chen (2008) Query biased snippet generation in XMLsearch. SIGMOD 2008.

Page 20: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

20

June 17, 2009

Requirements for workflow snippets

• Self-contained

– If a keyword matches a module, its parameters or annotation then thatmodule should be included in the snippets.

• Representative

– Include the modules representing the most prominent features of aworkflow and include them in the snippet.

• Distinguishable

– Find and display the structural differences among the workflows

• Small

– We show maximum g modules

Page 21: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

21

June 17, 2009

Selection strategy 1:Query neighborhood

• Identify the most important modules in the neighborhood ofmodules matching the keywords.

• Algorithm:

1. Choose the modules matching the keywords

2. Traverse the neighborhood to find closest modules with thehighest IDF-values

Page 22: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

22

June 17, 2009

Selection strategy 2:IDF

• Find a set of representative by choosing the modules with thehighest IDF values.

Page 23: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

23

June 17, 2009

Selection strategy 3:Grouping

• Find co-occuring modules as they correspond to aspecific functionality or semantic entity.

• Jaccard distance:

• Problem: NP-complete, we use a greedy version:

n

Mmmji

nM

mmdist

MMScore nji

,

),(

)(

GM

i

i

MMScoreGGScore )()(

Page 24: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

24

June 17, 2009

Evaluation:Important modules – compared to strategies

Choose the six most important modules in the workflow.

Page 25: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

25

Selection strategy 4:Difference highlighting

• Display differences and similarities among workflows in a resultset

• Identify the most prominant differences

Page 26: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

26

Snippet presentation

Independent of selection strategy there are several options forpresentation

–Text-based

–Dynamic image

–Legend

Page 27: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

27

Evaluation:Important features

Part3: Score workflow snippets

Page 28: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

28

Scientific workflows for exploringBioinformatics Web sources

Page 29: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

29

BioSpider

Page 30: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

30

BioSpider

Page 31: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

31

Content of this presentation:

Two scientific application areas

Provenance/Scientific workflows

Bioinformatics

Three different aspects

Interfaces for exploration

Seamless data integration

Effective data exploration

Page 32: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

32

Seamless data integration

The BioSpider allows:

Easy integration of data from various web sources

Tracking of data provenance

Little programming knowledge of the end user

However,

Each new object type (database) must be added as a new module

Requires large programming skills

How can we improve?

Page 33: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

33

Seamless data integration

Using available resources

MIRIAM

Page 34: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

34

Seamless data integration

Using available resources

MIRIAM

BioCatalogue

Page 35: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

35

Seamless data integration

Using available resources

MIRIAM

BioCatalogue

Allowing users to add new methods and knowledge

Page 36: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

36

Content of this presentation:

Two scientific application areas

Provenance/Scientific workflows

Bioinformatics

Three different aspects

Interfaces for exploration

Seamless data integration

Effective data exploration

Page 37: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

37

Effective data exploration

Complex data structure – often graph structure

Need for effective exploration methods

Data often represented as XML or RDF

Page 38: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

38

Hybrid XML Storage

XML dataXQuery

XML results

HybridDBMS

Mappinglayer

XML dataSQL/XQuery

XML /Relationalresults

Native

Hybrid

Page 39: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

39

Efficiency:Increasing query complexity

0

2000

4000

Species Path Path (2 step) Path (3 step) Path (4 step)

Native

Designed shredding

Automatic shredding

Page 40: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

40

Tool development: HShreX

Page 41: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

41

Working with HShreX:<?xml version="1.0" encoding="UTF-8"?>

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"xmlns:shrex="http://www.cse.ogi.edu/shrex">

<xs:element name="families"><xs:complexType>

<xs:sequence maxOccurs="unbounded"><xs:element name="family" type="familyType"/>

</xs:sequence></xs:complexType>

</xs:element>

<xs:complexType name="familyType"><xs:sequence>

<xs:element name="parent" type="parentType" ><xs:element name="child" type="childType" >

</xs:sequence></xs:complexType>

<xs:complexType name="parentType"><xs:sequence>

<xs:element name="name" type="xs:string"/><xs:element name="job" type="xs:string"/>

</xs:sequence></xs:complexType>

<xs:complexType name="childType"><xs:sequence>

<xs:element name="name" type="xs:string"/><xs:element name="school" type="xs:string"/>

</xs:sequence></xs:complexType>

</xs:schema>

Id Pid Name Job

2 1 Lena Lektor

Id Pid

0 -

Id Pid

1 0

Id Pid Name School

3 1 Ludvig Skolan

Families

Families_family

Families_family_parent

Families_family_child

Page 42: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

42

Working with HShreX:<?xml version="1.0" encoding="UTF-8"?>

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"xmlns:shrex="http://www.cse.ogi.edu/shrex">

<xs:element name="families"><xs:complexType>

<xs:sequence maxOccurs="unbounded"><xs:element name="family" type="familyType"/>

</xs:sequence></xs:complexType>

</xs:element>

<xs:complexType name="familyType"><xs:sequence>

<xs:element name="parent" type="parentType" ><xs:element name="child" type="childType"

shrex:maptoxml=“true”></xs:sequence>

</xs:complexType>

<xs:complexType name="parentType"><xs:sequence>

<xs:element name="name" type="xs:string"/><xs:element name="job" type="xs:string"/>

</xs:sequence></xs:complexType>

<xs:complexType name="childType"><xs:sequence>

<xs:element name="name" type="xs:string"/><xs:element name="school" type="xs:string"/>

</xs:sequence></xs:complexType>

</xs:schema>

Id Pid Name Job

2 1 Lena Lektor

Id Pid

0 -

Id Pid Child

1 0 <child><name>Ludvig</name><school>Skolan/school>

</child>

Families

Families_family

Families_family_parent

Page 43: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

43

Guidelines for Shredding XML:

Keep together what naturally belong together

Do not shred parts of the XML where the schema allows largevariation

Take variations of the actual data into account

Shred elements that are critical for performance

Prefer the representation that is required for query results

Page 44: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

44

Efficiency for graph queries

0

2000

4000

Species Path Path (2 step) Path (3 step) Path (4 step)

Native

Designed shredding

Automatic shredding

Page 45: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

45

Effective querying for workflows

Tool independent

capture all features of OPM

Complex queries on

structure,

versions,

subworkflow

similarity

Infrastructure for evaluation

Page 46: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

46

Collaborators

Bioinformatrics standards: Patrick Lambrix, He TanWorkflow snippets: Tommy Ellkvist, Juliana Freire,

Lauro Didier LinzBioSpider: Mikael Åsberg, Rickard PetterssonHShreX and hybrid storage: Mikael Åsberg, David Hall,

Valentina Ivanova, Juliana FreireEfficient storage for workflows: Valentina Ivanova,

Juliana Freire

Thanks!

Page 47: Effective Management and Exploration of Scientific … Management and Exploration of Scientific Data on the Web. Lena Strömbäck lena.stromback@liu.se Linköping University 2 Internet

www.liu.se