Top Banner
Analysis and Workflows Tutorials on Data Management Lesson 12: Analysis and Workflows CC image by Marc_Smith on Flickr
29

Tutorials on Data Management

Feb 17, 2016

Download

Documents

kylene

Tutorials on Data Management. Lesson 12: Analysis and Workflows. CC image by Marc_Smith on Flickr. Topics. Review of typical data analyses Reproducibility & provenance Workflows in general Computer-based scientific workflows (SWF) Benefits of SWF Examples of SWF and associated tools. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Tutorials on Data Management

Analysis and Workflows

Tutorials on Data ManagementLesson 12: Analysis and Workflows

CC

imag

e by

Mar

c_S

mith

on

Flic

kr

Page 2: Tutorials on Data Management

Analysis and Workflows

• Review of typical data analyses• Reproducibility & provenance• Workflows in general• Computer-based scientific workflows (SWF)• Benefits of SWF• Examples of SWF and associated tools

Topics

Page 3: Tutorials on Data Management

Analysis and Workflows

• After completing this lesson, the participant will be able to: o Understand a subset of typical analyses usedo Define a workflowo Define a SWFo Discuss the benefits of workflows in general and SWF in particularo Locate resources for using SWF

Learning Objectives

Page 4: Tutorials on Data Management

Analysis and Workflows

The Data Life CyclePlan

Collect

Assure

Describe

Preserve

Discover

Integrate

Analyze

Page 5: Tutorials on Data Management

Analysis and Workflows

• Conducted via personal computer, grid, cloud computing• Statistics, model runs, parameter estimations, graphs/plots

etc.

Data Analyses

Page 6: Tutorials on Data Management

Analysis and Workflows

• Processing: subsetting, merging, manipulating◦ Reduction: important for high-resolution datasets◦ Transformation: unit conversions, linear and nonlinear algorithms

Types of Analyses

0711070500276000 0711070600276000 0711070700277003 0711070800282017 0711070900285000 0711071000293000 0711071100301000 0711071200304000

Date timeair temp precip

C mm11-Jul-07 5:0027.6 00011-Jul-07 6:0027.6 00011-Jul-07 7:0027.7 00311-Jul-07 8:0028.2 01711-Jul-07 9:0028.5 00011-Jul-07 10:0029.3 00011-Jul-07 11:0030.1 00011-Jul-07 12:0030.4 000

Recreated from Michener & Brunt (2000)

Page 7: Tutorials on Data Management

Analysis and Workflows

• Graphical analyseso Visual exploration of data: search for patternso Quality assurance: outlier detection

Types of Analyses

Box and whisker plot of temperature by monthScatter plot of August Temperatures

Strasser, unpub. dataStrasser, unpub. data

Page 8: Tutorials on Data Management

Analysis and Workflows

• Statistical analyses Conventional statistics

o Experimental datao Examples: ANOVA, MANOVA, linear and

nonlinear regressiono Rely on assumptions: random sampling, random & normally distributed error, independent error terms, homogeneous variance

Descriptive statisticso Observational or descriptive datao Examples: diversity indices, cluster

analysis, quadrant variance, distance methods, principal component analysis, correspondence analysis

Types of Analyses

From Oksanen (2011) Multivariate Analysis of Ecological Communities in R: vegan tutorial

Example of Principle Component Analysis

Page 9: Tutorials on Data Management

Analysis and Workflows

• Statistical analyses (continued)o Temporal analyses: time serieso Spatial analyses: for spatial autocorrelationo Nonparametric approaches useful when conventional assumptions

violated or underlying distribution unknowno Other misc. analyses: risk assessment, generalized linear models,

mixed models, etc.• Analyses of very large datasetso Data mining & discoveryo Online data processing

Types of Analyses

Page 10: Tutorials on Data Management

Analysis and Workflows

• Re-analysis of outputs• Final visualizations: charts, graphs, simulations etc.

After Data Analysis

Science is iterative: The process that results in the final product

can be complex

Page 11: Tutorials on Data Management

Analysis and Workflows

• Reproducibility at core of scientific method• Complex process = more difficult to reproduce• Good documentation required for reproducibilityo Metadata: data about datao Process metadata: data about process used to create, manipulate,

and analyze data

Reproducibility

CC

imag

e by

Fel

ix63

on

Flic

kr

Page 12: Tutorials on Data Management

Analysis and Workflows

• Information about process used to get to data outputs• Related concept: data provenanceo Origins of datao Good provenance = able to follow data throughout entire life cycleo Allows for

• Replication & reproducibility• Analysis for potential defects, errors in logic, statistical errors• Evaluation of hypotheses

Process Metadata

Page 13: Tutorials on Data Management

Analysis and Workflows

• Formalization of process metadata• Precise description of scientific procedure• Conceptualized series of data ingestion, transformation, and

analytical steps• Three componentso Inputs: information or material requiredo Outputs: information or material produced & potentially used as

input in other stepso Transformation rules/algorithms (e.g. analyses)

Workflows in General

Page 14: Tutorials on Data Management

Analysis and Workflows

• Simplest form of workflow: flow chart

Workflows in General

Data import into R

Analysis: mean, SD

Graph production

Quality control & data cleaning

Page 15: Tutorials on Data Management

Analysis and Workflows

• Simplest form of workflow: flow chart

Workflows in General

Transformation Rules

Data import into R

Analysis: mean, SD

Graph production

Quality control & data cleaning

Page 16: Tutorials on Data Management

Analysis and Workflows

• Simplest form of workflow: flow chart

Workflows in General

Temperature data

Salinity data

Data import into R

Analysis: mean, SD

Quality control & data cleaning

“Clean” T & S data

Inputs & Outputs

Summary statistics

Data in R format

Page 17: Tutorials on Data Management

Analysis and Workflows

• Science is becoming more computationally intensive

• Sharing workflows benefits scienceo Scientific workflow systems make documenting workflows easier

• Simplest workflows: scripted languages

Workflows in General

Page 18: Tutorials on Data Management

Analysis and Workflows

• Analytical pipeline• Each step can be implemented in different software systems• Each step & its parameters/requirements formally recorded• Allows reuse of both individual steps and overall workflow

Scientific Workflows (SWF)

Page 19: Tutorials on Data Management

Analysis and Workflows

• Single access point for multiple analyses across software packages

• Keeps track of analysis and provenance: enables reproducibilityo Each step & its parameters/requirements formally recorded

• Workflow can be stored• Allows sharing and reuse of individual steps or overall

workflowo Automate repetitive taskso Use across different disciplines and groupso Can run analyses more quickly since not starting from scratch

Benefits of SWF

Page 20: Tutorials on Data Management

Analysis and Workflows

• Open-source, free, cross-platform• Drag-and-drop interface for workflow construction• Steps (analyses, manipulations etc) in workflow represented

by “actor”• Actors connect from a workflow• Possible applicationso Theoretical models or observational analyseso Hierarchical modelingo Can have nested workflowso Can access data from web-based sources (e.g. databases)

• Downloads and more information at kepler-project.org

Example of SWF: Kepler

Page 21: Tutorials on Data Management

Analysis and Workflows

Example of SWF: Kepler

Drag & drop components from this list

Actors in workflow

Page 22: Tutorials on Data Management

Analysis and Workflows

Example of SWF: Kepler

This model shows the solution to the classic Lotka-Volterra predator prey dynamics model. It uses the Continuous Time domain to solve two coupled differential equations, one that models the predator population and one that models the prey population. The results are plotted as they are calculated showing both population change and a phase diagram of the dynamics.

Page 23: Tutorials on Data Management

Analysis and Workflows

Example of SWF: Kepler

Resulting output

Page 24: Tutorials on Data Management

Analysis and Workflows

• Open-source• Workflow & provenance management support• Geared toward exploratory computational taskso Can manage evolving SWFo Maintains detailed history about steps & data

• www.vistrails.org

Other SWF Tools: VisTrails

Screenshot example

Page 25: Tutorials on Data Management

Analysis and Workflows

• Social networking site to support scientists that use SWF• Allows searching for, sharing, reuse of SWF• Can comment on and discuss contributed SWF• Gateway to journals and data repositories• www.myexperiment.org

Other SWF Tools: myExperiment

Page 26: Tutorials on Data Management

Analysis and Workflows

• Scientists should document workflows used to create resultso Data provenanceo Analyses and parameters usedo Connections between analyses via inputs and outputs

• Documentation can be informal (e.g. flowchart) or formal (e.g. Kepler)

Best Practices for Data Analysis

Page 27: Tutorials on Data Management

Analysis and Workflows

• Modern science is computer-intensiveo Heterogeneous data, analyses, software

• Reproducibility is important• Workflows = process metadatao Necessary for reproducibility, repeatability, validation

• SFW offers formal systems for documenting process metadatao Storage, sharing, visualization, reuse

Summary

Page 28: Tutorials on Data Management

Analysis and Workflows

• Y. Gil, E. Deelman, M. Ellisman, T. Fahringer, G. Fox et al. Examining the Challenges of Scientific Workflows. Computer 40,24–32 (2007).

• K. Michener, J. Beach, M. Jones, B. Ludäscher, D. Pennington et al. A knowledge environment for the biodiversity and ecological sciences. J. Intel. Info. Sys. 29, 111–126 (2007).

• B. Ludäscher, I. Altintas, S. Bowers, J. Cummings, T. Critchlow et al. Scientific Process Automation and Workflow Management. Comp. Sci. Ser. Ch 13 (Chapman and Hall, Boca Raton, 2009).

• T. McPhillips, S. Bowers, D. Zinn, B. Ludäscher. Scientific workflow design for mere mortals. Fut. Gen. Comp. Sys. 25, 541-551 (2009).

• B. Ludäscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger-Frank et al. Scientific workflow management and the kepler system. Conc. Comp. Prac. Exper., 18 (2006).

• W. Michener and J. Brunt, Eds. Ecological Data: Design, Management and Processing. (Blackwell, New York, 2000).

Resources for Data Analysis & SWF

Page 29: Tutorials on Data Management

Analysis and Workflows

We want to hear from you! CLICK the arrow to take our short survey.

Before you go . . .