Introduction Active Data Discussion Conclusion Active Data A Data-Centric Approach to Data Life-Cycle Management Anthony Simonet 1 Gilles Fedak 1 Matei Ripeanu 2 Samer Al-Kiswany 2 1 Inria, ENS Lyon, University of Lyon 2 University of British Columbia November 18th, 2013 A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 1/20
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Introduction Active Data Discussion Conclusion
Active DataA Data-Centric Approach to Data Life-Cycle Management
Anthony Simonet1 Gilles Fedak1
Matei Ripeanu2 Samer Al-Kiswany2
1Inria, ENS Lyon, University of Lyon 2University of British Columbia
November 18th, 2013
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 1/20
Introduction Active Data Discussion Conclusion
Outline
Introduction
Data Life Cycle Management
Use-case
Requirements
Active Data
Active Data: principles & features
Exemple: Globus Online and iRODS
Discussion
Advantages
Limitations
Conclusion
Related works
Conclusion
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 2/20
Introduction Active Data Discussion Conclusion
Big Data
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 3/20
I Science and Industry have become data-intensiveI Volume of data produced by science and industry grows exponentiallyI How to store this deluge of data?I How to extract knowledge and sense?I How to make data valuable?
I Some examplesI CERN’s Large Hadron Collider: 1.5PB/weekI Large Synoptic Survey Telescope, Chile: 30 TB/nightI Billion edge social network graphsI Searching and mining the Web
Introduction Active Data Discussion Conclusion
Data Life Cycle
Data Life Cycle
I Creation/Acquisition
I Transfer
I Replication
I Disposal/Archiving
Definition
The life cycle is the course of operational stages through whichdata pass from the time when they enter a system to the timewhen they leave it.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 4/20
Introduction Active Data Discussion Conclusion
Data Life Cycle Management
Complicated scenarios
I Execution of workflow
I Complex interactions between software
I Need to quickly react to operational events
Ad-hoc task-centric approaches
I Hard to program, maintain and debug
I No formal specification
I Complicates interactions between systems
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 5/20
Introduction Active Data Discussion Conclusion
Data Life Cycle Use-case
Example: the Advanced Photon Source at Argonne National LabI 100TB of raw data per dayI Raw data are preprocessed and registered in a Globus dataset
catalogI Data are analyzed by various applicationsI Results are stored in the dataset catalog and shared
Instrument(Beamline)
LocalStorage
Transfer
MetadataCatalog
Extract &Register Metadata
RemoteData Center
Transfer
AcademicCluster
Analysis
More analysis
Upload result
Register result metadata
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 6/20
Introduction Active Data Discussion Conclusion
Use-case
Task Centric
Vs
Data CentricI Independent scripts I Express data-dependancies
I Hard to program, maintain, verify I Cross data-center coordination
I Coarse granularity I User-level fault-tolerance
I Incremental processing
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 7/20
Introduction Active Data Discussion Conclusion
Requirements
Challenges: a perfect system should. . .
I Simply represent the life cycle of data distributed acrossdifferent data centers and systems
I Simplify DLM modeling and reasoning
I Hide the complexity resulting from using differentinfrastructures and systems
I Be easy to integrate with existing systems
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 8/20
Introduction Active Data Discussion Conclusion
Active Data principles
System programmers expose their system’s internal data life cyclewith a model based on Petri Nets.A life cycle model is made of Places and Transitions
•Created
t1
Written
t2
Read
t3
t4
Terminated
public void handler () {
computeMD5 ();
}
Each token has a unique identifier, corresponding to the actualdata item’s.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20
Introduction Active Data Discussion Conclusion
Active Data principles
System programmers expose their system’s internal data life cyclewith a model based on Petri Nets.A life cycle model is made of Places and Transitions
Created
t1
•Written
t2
Read
t3
t4
Terminated
public void handler () {
computeMD5 ();
}
A transition is fired whenever a data state changes.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20
Introduction Active Data Discussion Conclusion
Active Data principles
System programmers expose their system’s internal data life cyclewith a model based on Petri Nets.A life cycle model is made of Places and Transitions
Created
t1
•Written
t2
Read
t3
t4
Terminated
public void handler () {
computeMD5 ();
}
Code may be plugged by clients to transitions.It is executed whenever the transition is fired.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20
Introduction Active Data Discussion Conclusion
Active Data features
The Active Data programming model and runtime environment:
I Allows to react to life cycle progression
I Exposes transparently distributed data sets
I Can be integrated with existing systems
I Has scalable performance and minimum overhead overexisting systems
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 10/20
Introduction Active Data Discussion Conclusion
Implementation
I Prototype implemented in Java (' 2,800 LOC)
I Client/Service communication is Publish/SubscribeI 2 types of subscription:
I Every transitions for a given data itemI Every data items for a given transition
Active DataService
Client
Client
subscribeClient subscribe
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20
Introduction Active Data Discussion Conclusion
Implementation
I Several ways to publish transitionsI Instrument the codeI Read the logsI Rely on an existing notification system
I The service orders transitions by time of arrival
Active DataService
Client
publish transition
Client
subscribeClient subscribe
publish transi
tion
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20
Introduction Active Data Discussion Conclusion
Implementation
I Clients run transition handler code locallyI Transition handlers are executed
I SeriallyI In a blocking wayI In the order transitions were published
Active DataService
Client
publish transition
Client
subscribenotify
Client subscribe
notify
publish transi
tion
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20
Introduction Active Data Discussion Conclusion
Performance evaluation: Throughput
10 50 100 200 300 400 450 500 550# clients
5,000
10,000
15,000
20,000
25,000
30,000
35,000
Tra
nsi
tions
per
seco
nd
Figure: Average number of transitions per second handled by the ActiveData Service
Clients publish 10,000 transitions in a row without pausing.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 12/20
Introduction Active Data Discussion Conclusion
Performance evaluation: Throughput
10 50 100 200 300 400 450 500 550# clients
5,000
10,000
15,000
20,000
25,000
30,000
35,000
Tra
nsi
tions
per
seco
nd
Figure: Average number of transitions per second handled by the ActiveData Service
The prototype scales up to 30,000 transitions per seconds.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 12/20
Introduction Active Data Discussion Conclusion
Exemple: Data Provenance
Definition
The complete history of data life cycle derivations and operations.
I Assess the quality of data
I Keep track of the origin of data over time
I Specialized Provenance Aware Storage Systems
−→ What about heterogeneous systems?
Example with Globus Online and iRODS
File transfer service Data store and metadata catalog
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20
Introduction Active Data Discussion Conclusion
Exemple: Data Provenance
Definition
The complete history of data life cycle derivations and operations.
I Assess the quality of data
I Keep track of the origin of data over time
I Specialized Provenance Aware Storage Systems−→ What about heterogeneous systems?
Example with Globus Online and iRODS
File transfer service Data store and metadata catalog
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20
Introduction Active Data Discussion Conclusion
Exemple: Data Provenance
Definition
The complete history of data life cycle derivations and operations.
I Assess the quality of data
I Keep track of the origin of data over time
I Specialized Provenance Aware Storage Systems−→ What about heterogeneous systems?
Example with Globus Online and iRODS
File transfer service Data store and metadata catalog
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20
Introduction Active Data Discussion Conclusion
Exemple: Globus Online and iRODS
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20
Data events coming from Globus Online and iRODS
Created
GetPut
Terminated
t5
t9
t6
t7
t8 t10
iRODS
Id: {GO: 7b9e02c4-925d-11e2,iRODS: 10032}
public void handler () {
annotate ();
}
•
Created
t1 t2
Succeeded Failed
t3 t4
Terminated
Globus Online
Id: {GO: 7b9e02c4-925d-11e2}
public void handler () {
iput (...);
}
Introduction Active Data Discussion Conclusion
Exemple: Globus Online and iRODS
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20
Data events coming from Globus Online and iRODS
Created
GetPut
Terminated
t5
t9
t6
t7
t8 t10
iRODS
Id: {GO: 7b9e02c4-925d-11e2,iRODS: 10032}
public void handler () {
annotate ();
}
Created
t1 t2
•
Succeeded Failed
t3 t4
Terminated
Globus Online
Id: {GO: 7b9e02c4-925d-11e2}
public void handler () {
iput (...);
}
Introduction Active Data Discussion Conclusion
Exemple: Globus Online and iRODS
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20