Tools of the Data Smithe’s Trade Joe Smithe, Tim Hunter, Tad Slawecki, Steve Ruberg
Tools of the Data Smithe’s Trade
Joe Smithe, Tim Hunter, Tad Slawecki, Steve Ruberg
Before we begin, a thank you:
Drew Gronewold, Tim Hunter, Steve Ruberg, Ron Muzzi, more…
Special thanks to the IJC for the invite
Recommendations to the IJC
● The end point: storage, access, analysis, presentation○ Products of sensor technology infrastructure○ Data from sensors to users, decision makers, etc.
● Some old tech are fine● Some new tech are begging to be adopted● Do what is socially sustainable and secure
○ Account for the retiring generations and the up and coming working ones
○ Adopt technologies with support from many people
■ Fair chance of hackers, greater chance of good programmers who can fix things fast
Labyrinths of data, hard to get around...
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
http://s382.photobucket.com/user/Gandalf-lotr/media/Gandalfsfirework.jpg.html
http://corecanvas.s3.amazonaws.com/theonering-0188db0e/gallery/original/pippinmerry011128a.jpg
http://iihtofficialblog.blogspot.com/2014/07/5-vs-of-hadoop-big-data.html
http://iihtofficialblog.blogspot.com/2014/07/5-vs-of-hadoop-big-data.html
Overview of Infrastructure Technology
DISCLAIMER: I HAVE NOT WORKED WITH ALL OF THESE TECHNOLOGIES. THIS IS MERELY A
CATALOG OF TOOLS TO DISCUSS.
Overview of Infrastructure Technology
Overview
1. Target Platforms2. Data Storage Formats3. Data Management4. Model Coupling/Combining5. Probabilistic Modelling6. Distributed processing7. Modelling Services, Processing, Presentation8. Visualization and interaction
Overview
1. Target Platforms2. Data Storage Formats3. Data Management4. Model Coupling/Combining5. Probabilistic Modelling6. Distributed processing7. Modelling Services, Processing, Presentation8. Visualization and interaction
Target Platforms
DesktopMobile
orTablet
Web
Target Platforms
DesktopMobile
orTablet
Web
MS Windows● .NET● OneCoreApple● OS X and
Xcode*nix● Various
(Linux)
Win Phone
Apple● iOS
Android
Microsoft● ASP .NETLinux● LAMPOther● Wordpress● Drupal● Many more
Overview
1. Target Platforms2. Data Storage Formats3. Data Management4. Model Coupling/Combining5. Probabilistic Modelling6. Distributed processing7. Modelling Services, Processing, Presentation8. Visualization and interaction
Storage formats
● Plain text○ “Future proof”○ Growth can prove challenging○ Examples: XML, WaterML,
[other]ML, CSV● Binary
○ Computers eat this stuff up, but humans don’t. Good to have transformers to create downloadable and ingestible copies
○ Examples: GRiB, NetCDF
BluePenguino - Photobuckethttp://culturepopped.blogspot.com/2014/12/the-legends-of-pac-man.html
Overview
1. Target Platforms2. Data Storage Formats3. Data Management4. Model Coupling/Combining5. Probabilistic Modelling6. Distributed processing7. Modelling Services, Processing, Presentation8. Visualization and interaction
Data management
● Data provenance (origin) - copies aren’t great, version control systems offer limited help. Authoritative sources and citations to them mitigate noise, copies.
● Structured directories, even on the web● Relational Database Management Systems (RDBMSs)
○ Postgre SQL (recommended), MySQL, SQLite■ http://ask.metafilter.com/92162/MySQL-vs-PostgreSQL
○ Big Data - NoSQL, SciDB○ Geospatial - PostGIS, SpatialLite, MySQL Spatial
■ CUAHSI Hydroserver, THREDDS, MapServer, GeoServer, and Deegree implement above
■ Web services (accessibility)
Data management - new tech to adopt
● GRAPH DATABASES○ Fund them○ Power++
■ Utilizes the power of graphs to explore relationships between data points
■ Understand, investigate many to many, one to many, many to one relationships with ease
○ http://cyanohub.earth.lsa.umich.edu/
○ For more: http://neo4j.com/developer/graph-db-vs-
rdbms/ and http://mashable.com/2012/09/26/graph-databases/
Overview
1. Target Platforms2. Data Storage Formats3. Data Management4. Model Coupling/Combining5. Probabilistic Modelling6. Distributed processing7. Modelling Services, Processing, Presentation8. Visualization and interaction
Model coupling or combining
● Java-based Object Modelling System● OpenMI (Open Modelling Interface, C# and Java)
○ GUIs - OpenMI Configuration Editor, Pipistrelle
A lot of specialized models focus on limited domains, and via coupling, we can attain a modelling domain that spans current problems...
Overview
1. Target Platforms2. Data Storage Formats3. Data Management4. Model Coupling/Combining5. Probabilistic Modelling6. Distributed processing7. Modelling Services, Processing, Presentation8. Visualization and interaction
Probabilistic Modelling
● Bayesian hierarchical modelling is becoming a very popular approach in many problems where estimates are many but conclusions are few or divergent○ JAGS○ Stan
● Cha, Y. and C.A. Stow. 2014. A Bayesian network incorporating observation error to predict phosphorus and chlorophyll a in Saginaw Bay. Environmental Modelling & Software, 57: 90- 100
● Gronewold, A.D., J. Bruxer, D. Durnford, J. Smith, A. Clites, F. Seglenieks, T. Hunter, S. Qian, V. Fortin (Accepted, 2016).
Hydrological drivers of record-setting water level rise on Earth’s
largest lake system. Water Resources Research.
Overview
1. Target Platforms2. Data Storage Formats3. Data Management4. Model Coupling/Combining5. Probabilistic Modelling6. Distributed processing7. Modelling Services, Processing, Presentation8. Visualization and interaction
Distributed processing
● High Performance Computers (HPCs, formerly Super)● MapReduce (key/value pairs as input)
○ programming model, similar to the Message Passage Interface (MPI)
○ scalable○ reputable fault tolerance (robust)
■ Apache Hadoop (an implementation)■ R and Hadoop Integrated Processing Environment
(RHIPE)
Overview
1. Target Platforms2. Data Storage Formats3. Data Management4. Model Coupling/Combining5. Probabilistic Modelling6. Distributed processing7. Modelling Services, Processing, Presentation8. Visualization and interaction
Modelling Services, Processing, Presentation
● Matlab, R, Python (Anaconda distribution), assisted with shell scripting○ http://www.talyarkoni.org/blog/2013/11/18/the-homogenization-of-scientific-computing-or-why-python-
is-steadily-eating-other-languages-lunch/
● Julia● Web Development
○ PHP, Javascript (and packages, more later)○ Frameworks under Java, Python, Ruby on Rails○ *.NET Frameworks (Microsoft)○ Backbone.js, Django
○ Content Management Systems (CMSs) such as Drupal, CKAN
Overview
1. Target Platforms2. Data Storage Formats3. Data Management4. Model Coupling/Combining5. Probabilistic Modelling6. Distributed processing7. Modelling Services, Processing, Presentation8. Visualization and interaction
Fireworks (Visualization)
● Often cast as the data themselves...
● Javascript Packages: jqPlot, Flot, Processing (language), Raphaël, D3 (successor to Protovis), Google Charts, and Dygraphs
● Apache Flex● Mapping: OpenLayers, Google Earth/Maps● Interfaces: CUAHSI HydroShare, QGIS (like ArcGIS), uDig● Desktop plotting packages:
○ R: ggplot2, ggvis, rgl, and default packages○ Python: Matplotlib, Plotly, Pychart...
■ https://wiki.python.org/moin/NumericAndScientific/Plotting
jpTheSmithe.com
All from Environmental Modelling and Software:
● Web technologies for environmental big data (Open Access), Vitolo et al. (2015)
● Web based visualization of large climate data sets, J. R. Alder and S.W. Hostetler (2015)
● A review of open source software solutions for developing water resources web applications, Swain et al. (2015)
And we’ll probably do this again in 5-10 years next year!
Relevant parchments:
Recommendations to the IJC
● The end point: storage, access, analysis, presentation○ Products of sensor technology infrastructure○ Data from sensors to users, decision makers, etc.
● Some old tech are fine● Some new tech are begging to be adopted● Do what is socially sustainable and secure
○ Account for the retiring generations and the up and coming working ones
○ Adopt technologies with support from many people
■ Fair chance of hackers, greater chance of good programmers who can fix things fast