Automation of Biological Data Analysis and Report Generation Dmitry Grapov, PhD
May 10, 2015
Automation of Biological Data Analysis and Report Generation
Dmitry Grapov, PhD
Bots write the darndest things
http://www.latimes.com/local/lanow/earthquake-27-quake-strikes-near-westwood-california-rdivor,0,3229825.story#axzz2wQwc82EK
• fill in the template (easy)
• human-guided automation (e.g. Metaboanalyst, intermediate)
• intelligent/reactive writing (e.g. ~AI, advanced)
http://narrativescience.com/
Humans + Bots
Interaction:
• Bots and humans combine in guided analyses
• Humans: make choices (based on bot guides)
• Bots: automate!
Facilitate:
• workflow logging and template creation
• reproducible results
Bot: Initial data and meta data parsing and quality validation
(need: template input)
Human: data cleaning and experimental design identification
(use: multiple choice, dynamic GUI)
Bot: instantiation of complex workflows
Human: overview of bot assumptions and results
Bot: Numerical and text output generation
Humans + Bots write darndender things?
Choose Your Own Life Adventure!
?
https://github.com/
dgrapov/AdventureR
Data Analysis Tasks
Visualization (how does it look?)
• histograms, density plots, box plots, line plots, scatter plots, networks, etc.
Statistical Analysis (what is statistically significant?)
• summary tables, ANOVA, FDR adjustment, power analysis, etc.
Exploration (what are the major patterns/trends?)
• clustering, PCA, ICA, etc.
Predictive Modeling (what explains my hypothesis?)
• mixed effects, partial least squares (O-/PLS/-DA), etc.
Network Analysis and Mapping (how are things related?)
• Functional analysis: pathway enrichment or overrepresentation
• Networks: biochemical, structural, mass spectral and empirical networks
• Mapping: projection of analysis results onto network
WCMC Data Analysis Reports ™
Statistical analysisClusteringPCAO-PLS-DABiochemical enrichmentNetwork mapping
Input template: BinBase
• inference of experimental goals from sample meta data
• mapping variables to external databases
Tasks:
Report:
Tools:
Automation Challenges
Data cleaning and quality validation
• use: quality control samples; identify: precision/accuracy, normalization, batch corrections; mitigate: outliers, missing values, batch effects, etc.
Identification of experimental goals
• use: meta data, identify: main and accessory effects; choose: statistics, multivariate tests and visualizations
Integration of multiple tasks to evolve robust analyses • tasks: statistics, multivariate, functional, networks,
database mapping, etc
Data analysis report generation
• use: R, Latex, markdown
?
Challenges to automated metabolite ID mapping
Stereochemistry?
Search: catechin
Best Match: Catechin
Biologically relevant:
D-catechin
Synonyms?
Search: UDP GlcNAc
FAIL: UDP GlcNac
PASS: UDP-GlcNac
Strategies for automated metabolite ID mapping (from synonym)
#1: CTS+ #2: Web query #3: Curated DB
• Use CTS to translate from synonyms to KEGG (KID) and PubChem (CID)
• Use KEGGREST and PUG to filter and choose most appropriate IDs
• Use fuzzy matching and word similarity metrics (e.g. Damerau–Levenshtein distance)
• Use KEGGREST + PubChem PUG to translate synonyms to IDs
• For KEGG ID:
synonym SID KID
• Generate a curated DB for KEGG and CID translations +
• Include InChI Keys
• Map to other DBs
• Allow fuzzy matching on synonyms
• e.g. IDEOM http://bioinformatics.oxfordjournals.org/content/early/2012/02/04/bioinformatics.bts069
Interactive Analysis and Report Generation
knitr (http://yihui.name/knitr/)
Analysis Report Generation
• Analysis on rails or open sandbox
• Humans facilitate robust results generation + Bots ensure reproduction
• Generation of Methods and Results should be automateable
Devium 2.0Human-guided automated data analysis and report generator
Human-guided automation could help ensure robust results by making choices which are otherwise difficult to automate.
https://github.com/dgrapov/DeviumWeb
MetaMapRLinking data analysis and
biologyhttps://github.com/dgrapov/MetaMapR
Integration of complex work flows is key to automation.
+ Workflows for complex experiments (e.g. time-course)
+ Biochemical functional analysis (pathway enrichment)
+ GUI for report generation (Devium 2.0)
+ Integrate multi-’Omic’ data sets (MetaMapR 2.0)
+ Scientific literature mining (RapportR)
+ Interactive plots and networks (JavaScript)
Future Goals
[email protected] metabolomics.ucdavis.edu
This research was supported in part by NIH 1 U24 DK097154