Top Banner
Automation of Biological Data Analysis and Report Generation Dmitry Grapov, PhD
14

Automation of (Biological) Data Analysis and Report Generation

May 10, 2015

Download

Education

Dmitry Grapov

I've been experimenting with automating simple and complex data analysis and report generation tasks for biological data and mostly using R and LATEX. You can see some of my progress and challenges encountered.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Automation of (Biological) Data Analysis and Report Generation

Automation of Biological Data Analysis and Report Generation

Dmitry Grapov, PhD

Page 2: Automation of (Biological) Data Analysis and Report Generation

Bots write the darndest things

http://www.latimes.com/local/lanow/earthquake-27-quake-strikes-near-westwood-california-rdivor,0,3229825.story#axzz2wQwc82EK

• fill in the template (easy)

• human-guided automation (e.g. Metaboanalyst, intermediate)

• intelligent/reactive writing (e.g. ~AI, advanced)

http://narrativescience.com/

Page 3: Automation of (Biological) Data Analysis and Report Generation

Humans + Bots

Interaction:

• Bots and humans combine in guided analyses

• Humans: make choices (based on bot guides)

• Bots: automate!

Facilitate:

• workflow logging and template creation

• reproducible results

Bot: Initial data and meta data parsing and quality validation

(need: template input)

Human: data cleaning and experimental design identification

(use: multiple choice, dynamic GUI)

Bot: instantiation of complex workflows

Human: overview of bot assumptions and results

Bot: Numerical and text output generation

Page 4: Automation of (Biological) Data Analysis and Report Generation

Humans + Bots write darndender things?

Choose Your Own Life Adventure!

?

https://github.com/

dgrapov/AdventureR

Page 5: Automation of (Biological) Data Analysis and Report Generation

Data Analysis Tasks

Visualization (how does it look?)

• histograms, density plots, box plots, line plots, scatter plots, networks, etc.

Statistical Analysis (what is statistically significant?)

• summary tables, ANOVA, FDR adjustment, power analysis, etc.

Exploration (what are the major patterns/trends?)

• clustering, PCA, ICA, etc.

Predictive Modeling (what explains my hypothesis?)

• mixed effects, partial least squares (O-/PLS/-DA), etc.

Network Analysis and Mapping (how are things related?)

• Functional analysis: pathway enrichment or overrepresentation

• Networks: biochemical, structural, mass spectral and empirical networks

• Mapping: projection of analysis results onto network

Page 6: Automation of (Biological) Data Analysis and Report Generation

WCMC Data Analysis Reports ™

Statistical analysisClusteringPCAO-PLS-DABiochemical enrichmentNetwork mapping

Input template: BinBase

• inference of experimental goals from sample meta data

• mapping variables to external databases

Tasks:

Report:

Tools:

Page 7: Automation of (Biological) Data Analysis and Report Generation

Automation Challenges

Data cleaning and quality validation

• use: quality control samples; identify: precision/accuracy, normalization, batch corrections; mitigate: outliers, missing values, batch effects, etc.

Identification of experimental goals

• use: meta data, identify: main and accessory effects; choose: statistics, multivariate tests and visualizations

Integration of multiple tasks to evolve robust analyses • tasks: statistics, multivariate, functional, networks,

database mapping, etc

Data analysis report generation

• use: R, Latex, markdown

?

Page 8: Automation of (Biological) Data Analysis and Report Generation

Challenges to automated metabolite ID mapping

Stereochemistry?

Search: catechin

Best Match: Catechin

Biologically relevant:

D-catechin

Synonyms?

Search: UDP GlcNAc

FAIL: UDP GlcNac

PASS: UDP-GlcNac

Page 9: Automation of (Biological) Data Analysis and Report Generation

Strategies for automated metabolite ID mapping (from synonym)

#1: CTS+ #2: Web query #3: Curated DB

• Use CTS to translate from synonyms to KEGG (KID) and PubChem (CID)

• Use KEGGREST and PUG to filter and choose most appropriate IDs

• Use fuzzy matching and word similarity metrics (e.g. Damerau–Levenshtein distance)

• Use KEGGREST + PubChem PUG to translate synonyms to IDs

• For KEGG ID:

synonym SID KID

• Generate a curated DB for KEGG and CID translations +

• Include InChI Keys

• Map to other DBs

• Allow fuzzy matching on synonyms

• e.g. IDEOM http://bioinformatics.oxfordjournals.org/content/early/2012/02/04/bioinformatics.bts069

Page 10: Automation of (Biological) Data Analysis and Report Generation

Interactive Analysis and Report Generation

knitr (http://yihui.name/knitr/)

Analysis Report Generation

• Analysis on rails or open sandbox

• Humans facilitate robust results generation + Bots ensure reproduction

• Generation of Methods and Results should be automateable

Page 11: Automation of (Biological) Data Analysis and Report Generation

Devium 2.0Human-guided automated data analysis and report generator

Human-guided automation could help ensure robust results by making choices which are otherwise difficult to automate.

https://github.com/dgrapov/DeviumWeb

Page 12: Automation of (Biological) Data Analysis and Report Generation

MetaMapRLinking data analysis and

biologyhttps://github.com/dgrapov/MetaMapR

Integration of complex work flows is key to automation.

Page 13: Automation of (Biological) Data Analysis and Report Generation

+ Workflows for complex experiments (e.g. time-course)

+ Biochemical functional analysis (pathway enrichment)

+ GUI for report generation (Devium 2.0)

+ Integrate multi-’Omic’ data sets (MetaMapR 2.0)

+ Scientific literature mining (RapportR)

+ Interactive plots and networks (JavaScript)

Future Goals

Page 14: Automation of (Biological) Data Analysis and Report Generation

[email protected] metabolomics.ucdavis.edu

This research was supported in part by NIH 1 U24 DK097154