Why should you trust my data code4lib 2016

Why Should You Trust My Data?building data infrastructure that accommodates networks of trust

Matt Zumwalt

datjawn.com | databindery.com

@flyingzumwaltcode{4}lib 2016

http://datjawn.comhttp://databindery.com

Im interested in trust.

Im interested in trust.particularly trust & trustworthiness

when people exchange data

theres a rhythm to the computing world

centralization decentralization

client-server peer-to-peer

mainframes

personal computers

server farms

[internet of everything]the cloud

the PC revolution

computers

the diamond age

remember mainframes?

image credit wikipedia

https://en.wikipedia.org/wiki/UNIVAC#/media/File:UnivacII.jpg

the www

host datareference each other

but data

image credit Torkild Retvedt

https://www.flickr.com/photos/torkildr/3462606643

$$

$$

$$

$

By 2019 the data created by IoE devices alone will be 49 times higher than all the traffic that moved through

datacenters in 2014.

it wont scale.

Reference: Cisco Global Cloud Index

http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html

Worldwide Storage Capacity in 2012: 2.5 zettabytes

Total Data Center Traffic in 2016: 10.4 zettabytes per year

Anticipated data created by Internet of Everything (IoE) devices in 2019:

507.5 zettabytes per year

References: NetApp Cisco Global Cloud Index gigaom Washington Post

http://siliconangle.com/blog/2012/05/21/when-will-the-world-reach-8-zetabytes-of-stored-data-infographic/http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.htmlhttps://gigaom.com/2012/05/30/heres-what-our-web-addiction-looks-like-in-2016/https://www.washingtonpost.com/blogs/ezra-klein/post/how-big-can-the-internet-get/2012/05/30/gJQAu9OH2U_blog.html

distributed data web

You cant propose that something be a universal space and at the

same time keep control of it. - Tim Berners Lee

http://webfoundation.org/about/vision/history-of-the-web/

this relies on trust

elements of trustworthiness

authority & reputation integrity & provenance synergy or compatibility

consistency etc

weve got thisOrganisms have been solving

these problems for eons Humans for millennia

Librarians for centuries Software developers for decades

git for (tabular) data

transparency & reproducibility

http://datjawn.com builds from the work of http://dat-data.com

Tabular: rows & columns (ie. Spreadsheets, CSV, SQL DBs)

http://datjawn.comhttp://dat-data.com

history has branches

initial commit

a set of changes

commit those changes and describe them

Who made the changes? Why did they make them?

When did they commit them?

more changes

commit those changes

different changes committed to a different branch

other changes on another branch

merge two branches

get a specific version prove its identical know who made it

Files are data. They have histories.

Metadata are data. They have histories too. Whatever the data,

The same patterns apply.

How does this get replicated?

client-server approach

peer to peer approach

the tide has already shifted

Stop building server-side applications. Assume that data are anywhere and/or everywhere.

Assume that your software will be run in many places. Erase your distinctions between server and client.

Let data grow branches - build trees (ie. Merkle DAGs) Stop thinking of data as singular.

Stop thinking of datasets as monolithic. Embrace redundancy & replication.

Understand that trustworthiness and authority are dynamic. Broaden your sense of now.

Appreciate provenance.

there are no servers there is only the web

Meet the dat jawn team on Wednesday

Matt Zumwalt

datjawn.com | databindery.com

@flyingzumwaltcode{4}lib 2016

http://datjawn.comhttp://databindery.com

Why should you trust my data code4lib 2016

Technology