Top Banner
Can we get scientists to share data through self - interest? C. Titus Brown UC Davis [email protected]
35
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 2015 balti-and-bioinformatics

Can we get scientists to share data through

self-interest?C. Titus Brown

UC Davis

[email protected]

Page 2: 2015 balti-and-bioinformatics

Thanks, Nick!

This is an attempt to explain why I pitched this:

http://ivory.idyll.org/blog/2014-moore-ddd-talk.html

and talk about what I’d like to do with the money.

Page 3: 2015 balti-and-bioinformatics

The way data commonly gets published

Gather data Analyze data Write paperPublish paper

and data

Page 4: 2015 balti-and-bioinformatics

Many failure modes:

Gather data Analyze data Write paperPublish paper

and dataXLack of expertise;

Lack of tools;Lack of compute;

Bad experimental design;

Page 5: 2015 balti-and-bioinformatics

Many failure modes:

Gather data Analyze data Write paperPublish paper

and dataX(The usual reasons)

Page 6: 2015 balti-and-bioinformatics

One failure mode in particular:

Gather data Analyze data Write paperPublish paper

and data

Other dataX

Page 7: 2015 balti-and-bioinformatics

One failure mode in particular:

Gather data Analyze data Write paperPublish paper

and data

Other dataXLots of biological data doesn’t make

sense, except in the light of other data.

This is especially true in two of the fields I work in, environmental metagenomics and non-model

mRNAseq

Page 8: 2015 balti-and-bioinformatics

(For example: gene annotation by homology)

Anything else Mollusc Cephalopod

no similarity

Page 9: 2015 balti-and-bioinformatics

One failure mode in particular:

Gather data Analyze data Write paperPublish paper

and data

Other dataXLots of biological data doesn’t make

sense, except in the light of other data.

This is especially true in two of the fields I work in, environmental metagenomics and non-model

mRNAseq

Page 10: 2015 balti-and-bioinformatics

Hmm.

Data publication

Data publication

Data analysis

Data analysis

Page 11: 2015 balti-and-bioinformatics

I believe:There are many interesting and useful data sets

immured behind lab walls by lack of:

• Expertise

• Tools

• Compute

• Well-designed experimental setup

• Pre-analysis data publication culture in biology

• Recognition that sometimes hypotheses just get in

the way

• Good editorial judgment

Page 12: 2015 balti-and-bioinformatics

I believe:There are many interesting and useful data sets

immured behind lab walls by lack of:

• Expertise

• Tools

• Compute

• Well-designed experimental setup

• Pre-analysis data publication culture in biology

• Recognition that sometimes hypotheses just get in

the way

• Good editorial judgment

Page 13: 2015 balti-and-bioinformatics

(Side note)

The existence of journals that will let you publish

virtually anything should have really helped data

availability!

Sadly, many of them don’t enforce data publication

rules.

Page 14: 2015 balti-and-bioinformatics

Data publications!The obvious solution: data pubs!

(“Pre-publication data sharing”)

Make your data available so that others can cite it!

GigaScience, Data Science, etc.

…but we don’t yet reward this culturally in biology.

(True story: no one cares, yet.)

I’m actually uncertain myself about how much we should reward data and source code pubs. But we can talk later.

Page 15: 2015 balti-and-bioinformatics

Pre-publication data sharing?

There is no obvious reason to make data available prior to publication of its analysis.

There is no immediate reward for doing so.

Neither is there much systematized reward for doing so.

(Citations and kudos feel good, but are cold comfort.)

Worse, there are good reasons not to do so.

If you make your data available, others can take advantage of it…

…but they don’t have to share their data with you in order to do so.

Page 16: 2015 balti-and-bioinformatics

This bears some similarity to the Prisoners’ Dilemma:

http://www.acting-man.com/?p=34313

“Confession” here is notsharing your data.

Note: I’m not a game theorist (but some of my best friends are).

Page 17: 2015 balti-and-bioinformatics

So, how do we get academics to share their data!?

Two successful “systems” (send me more!!)

1. Oceanographic research

2. Biomedical research

Page 18: 2015 balti-and-bioinformatics

1. Research cruises are expensive!

In oceanography,

individual researchers cannot afford to set up a cruise.

So, they form scientific consortia.

These consortia have data sharing and preprint sharing

agreements.

(I’m told it works pretty well (?))

Page 19: 2015 balti-and-bioinformatics

2. Some data makes more sense

when you have more data

Omberg et al., Nature Genetics, 2013.

Sage Bionetworks et al.:

Organize a consortium to generate data;Standardize data generation;Share via common platform;Store results, provenance, analysis descriptions, and source code;Run a leaderboard for a subset of analyses;Win!

Page 20: 2015 balti-and-bioinformatics

This “walled garden” model is interesting!

“Compete” on analysis, not on data.

Page 21: 2015 balti-and-bioinformatics

Some notes -• Sage model requires ~similar data in common

format;

• Common analysis platform then becomes immediately useful;

• Data is ~easily re-usable by participants;

• Publication of data becomes straightforward;

• Both models are centralized and coordinated.

Page 22: 2015 balti-and-bioinformatics

The $1.5m question(s):

• Can we “port” this sharing model over to

environmental metagenomics, non-model

mRNAseq, and maybe even VetMed and

agricultural research?

• Can we use this model to drive useful pre-

publication data sharing?

• Can we take it from a coordinated and centralized

model to a decentralized model?

Page 23: 2015 balti-and-bioinformatics

A slight digression -Most data analysis models are based on centralizing data

and then computing on it there. This has several failure points:

• Political: expect lots of biomedical, environmental data to be restricted geopolitically.

• Computation: in the limit of infinite data…

• Bandwidth: in the limit of infinite data…

• Funding: in the limit of infinite data…

Page 24: 2015 balti-and-bioinformatics

Proposal: distributed graph database server

Compute server

(Galaxy?

Arvados?)

Web interface + API

Data/

Info

Raw data sets

Public

servers

"Walled

garden"

server

Private

server

Graph query layer

Upload/submit

(NCBI, KBase)

Import

(MG-RAST,

SRA, EBI)

Page 25: 2015 balti-and-bioinformatics

Graph queriesacross public & walled-garden data sets:

See Lee, Alekseyenko, Brown, 2009, SciPyProceedings: ‘pygr’ project.

raw sequence

assembled sequence

nitrite reductase ppaZ

SIMILAR TO ALSO CONTAINS

Page 26: 2015 balti-and-bioinformatics

Graph queriesacross public & walled-garden data sets:

“What data sets contain <this gene>?”

“Which reads match to <this gene>, but not in

<conserved domain>?”

“Give me relative abundance of <gene X>

across all data sets, grouped by nitrogen

exposure.”

Page 27: 2015 balti-and-bioinformatics

Thesis:

If we can provide immediate returns for data sharing,

researchers will do so, and do so immediately.

Not to do so would place them at a competitive

disadvantage.

(All the rest is gravy: open analysis system,

reproducibility, standardized data format, etc.)

Page 28: 2015 balti-and-bioinformatics

Puzzle pieces.

1. Inexpensive and widely available cloud computing

infrastructure?

Yep. See Amazon, Google, Rackspace, etc.

Page 29: 2015 balti-and-bioinformatics

Puzzle pieces.

2. The ability to do many or most sequence analyses

inexpensively in the cloud?

Yep. This is one reason for khmer & khmer-protocols.

Page 30: 2015 balti-and-bioinformatics

Puzzle pieces.

3. Locations to persist indexed data sets for use in

search & retrieval?

figshare & dryad (?)

Page 31: 2015 balti-and-bioinformatics

Puzzle pieces.

4. Distributed data mining approaches?

Some literature, but I know little about it.

Page 32: 2015 balti-and-bioinformatics

In summary:How will we do this?

I PLAN TO FAIL.

A LOT.

PUBLICLY.(ht @ethanwhite)

Page 33: 2015 balti-and-bioinformatics

In summary:How will we know if (or when) we’ve “won”?

1. When people use, extend, and remix our software

and concepts without talking to us about it first.

(c.f. khmer!)

2. When the system becomes so useful that people go

back and upload old data sets to it.

Page 34: 2015 balti-and-bioinformatics

In summary:The larger vision

Enable and incentivize sharing by providing

immediate utility; frictionless sharing.

Permissionless innovation for e.g. new data

mining approaches.

Plan for poverty with federated infrastructure

built on open & cloud.

Solve people’s current problems, while

remaining agile for the future.

Page 35: 2015 balti-and-bioinformatics

Thanks!

References and pointers welcome!

https://github.com/ged-lab/buoy

(Note: there’s nothing there yet.)