Data Science and Scientific Discovery New Approaches to Nature’s Complexity Dr. John Rumble President R&R Data Services Gaithersburg MD www.randrdata.com [email protected]
Feb 24, 2016
Data Science and Scientific
DiscoveryNew Approaches to Nature’s
Complexity
Dr. John RumblePresident
R&R Data ServicesGaithersburg MD
To understand scientific and technical data today, we must first understand how the information revolution has changed both Science and Data and their relationship
2DC Data Science May 2012
My Talk
1. Science today2. The Data Revolution in science3. Scientific data and scientific databases4. Data and scientific discovery5. The challenges of using data science on scientific
data
3DC Data Science May 2012
Why Do We Do Science
Two primary motivations for advancing science
• First is our insatiable thirst to understand the world– probably from when we started thinking
• Second is a direct result of the Industrial Revolution: How does the technology we are inventing actually work?
4DC Data Science May 2012
21st Century Science
• From the fundamental to the complex– Determining the laws of nature for a
few particles to understanding real systems - cells, the atmosphere, the Earth, ecology
• From reductionism to constructionism– Using our basic knowledge to make
models and predict behavior of real systems – that is all systems we find in nature or that we can construct
5DC Data Science May 2012
6DC Data Science May 2012
Science Vol. 336, p. 707 (2012)
7DC Data Science May 2012
J. Schmitt et al, Science vol 336, p 708, 2012
Today’s Science and E-Science
The Data Revolution has enabled E-Science through– Advanced telecommunications and networks– Computation power and storage– New algorithms for data management,
visualization, analysis, and mathematics• Today, E-Science can be done faster and more
powerfully, and scientific communication can occur almost instantly
The real revolution, however, is in the relationship between science and data
8DC Data Science May 2012
9DC Data Science May 2012
When it hits the New York Times, you know it is for real!
Science and Data
To understand scientific and technical data today, we must first understand how the information revolution has changed Science and Data and their relationship
• Science today is not about reduction to a few basic laws
• Science is about how do we understand and control all aspects of nature
• How is this done?– By careful
measurement, accurate tests, keen observations, and powerful models and simulations that lead to scientific knowledge
• The results are expressed as scientific data!
10DC Data Science May 2012
Scientific Knowledge
• What does this really mean?
11DC Data Science May 2012
Scientific Knowledge
• What does this really mean?
12DC Data Science May 2012
Recognize a new
phenomenon
Analyze its components Identify the
variables that govern it
Isolate the important variablesDemonstrate
understanding by control
Change the phenomenon
Scientific knowledge means understanding the independent variables governing a phenomenon and how they influence it
Science Today
A major theme of science today is that we are able to make accurate measurements on a complex world that– Advance our understanding of nature,– Improve our ability to harness technology,
And, in spite of many challenges,– Increase the importance of science to
society in the futureScientific data are at the core of modern
science13DC Data Science May 2012
My Talk
1. Science today2. The Data Revolution in science3. Scientific data and scientific databases4. Data and scientific discovery5. The challenges of using data science on scientific
data
14DC Data Science May 2012
The Data Revolution in Science
15DC Data Science May 2012
Today, E-Science is real
• Computer at every desk• Connectivity: The Internet/WWW
explosion• Computerized experiments
and observations • Database tools on every
computer• Electronic publications • Model and simulation-
based R&D• Comprehensive databases • Virtual libraries
Four Ways to Generate Scientific Data
• Observations• Experiments• Standardized testing• Modeling and simulation
16DC Data Science May 2012
Observational Science Today
Today we have exciting new capability to observe nature in situ better than ever before– Hubble Space Telescope– High sensitivity seismographs– Bio-macromolecule sequencing instruments– LTER (Long-term ecological research) platforms– Earth-observing satellites– High power computers to analyze data
Generates huge amounts of quality data
17DC Data Science May 2012
Experimental Science Today
18DC Data Science May 2012
Today we have exciting new capability to observe nature in controlled circumstances better than ever before– Atomic force microscopes– Micro-electronics and lasers– High energy accelerators– Femto-second chemical
reactors– High power computers to
analyze data
Generates large amounts of high quality data
Testing Today
Today we have new capability to test and analyze materials using standard methods– Electronic test equipment– Analytical databases fully integrated into equipment– Analyzing unknown substances– Carbon and other techniques dating objects– Genomic sequencing– National and international standard test procedures– Data analysis tools to generate properties– Self-calibrating instruments
Generates medium amounts of high quality data
19DC Data Science May 2012
Computation Today
We now also have the ability to create a Virtual World
Models and simulations of complex systems Techniques to do advanced mathematics Computers to execute immense calculations Visualization tools to examine our virtual
world
Uses and generates large amounts of data
20DC Data Science May 2012
Characteristics of Approaches for Generating Scientific Data
ApproachType
ScienceIndependent
Variables Examples Data VolumesObservational Big As available,
manySatellites, Seismic, census
Large
Small As available, few Biodiversity, social Small
Experiment Big Chosen, few High energy physics Large
Small Chosen, few Chemistry Small
Test Big Specified, few Genomics Medium
Small Specified, many Materials testing, imaging, structure determination
Small to large
Modeling Big Chosen, many Climate change, epidemiology
Medium
Small Chosen, few All disciplines Small
21DC Data Science May 2012
The Data Revolution in Science is Real
• Observation, experimentation, testing, and calculation all produce, and in some cases use, large amounts of data
• E-Science has provided an incredible array of tools, technologies, and methods to collect, store, manage, analyze, exploit, preserve, and disseminate these data
Science today is more fully based on data and data collections than ever before!
22DC Data Science May 2012
My Talk
1. Science today2. The Data Revolution in science3. Scientific data and scientific databases4. Data and scientific discovery5. The challenges of using data science on scientific
data
23DC Data Science May 2012
Scientific Data and Scientific Databases
Data communicate measurement (experimental and observational) and computational results
“When you can measure what you are speaking about, and express it in numbers, you know something about it; Lord Kelvin
24DC Data Science May 2012
Types of Scientific Data
• Numbers• Simple text • Complex text• Equations• Graphs• Diagrams• Pictures• Software• Rules
25DC Data Science May 2012
• 1, 2, 3…• ABCs• Greek, scripts, symbol• E=mc2
All Data Are Not the Same
• Measurement or property: There is a difference!
• Measurements are a one-time look at nature
• Properties are the inherent characteristics of nature – They are Nature Itself
26DC Data Science May 2012
Measurements are for Today
27DC Data Science May 2012
• Measurements are what you see now
• Capture one point of view
• Usually limited number of variables changed
One of 1300 measurements of Diego Giacometti
Properties are Forever
• Properties are the real thing
• Need many repeated measurements
• Far too many substances and systems to determine properties
• Will never properties of everything
28DC Data Science May 2012
The real Diego Giacometti
Scientific Knowledge
Theories Models
Hypotheses Questions
Data
Measure-ment
The Classical Paradigm for Science and Data
29
Scientific Knowledge
Theories Models
Hypotheses Questions
Data
Measure-mentData
Collections
The True data paradigm has always been this
30
Scientific Databases in History
• Preserved data collections (large and small) • At first, simply data preservation• Data was stored, but not really exploited
1. Accuracy2. Comprehensiveness3. Systematizing
31DC Data Science May 2012
Accuracy
Newgrange – Ireland• 6000 years old• Aligned to the rising sun in the winter solstice• Depended on careful observational data on the rising sun• One data point!
32DC Data Science May 2012
Volume and Accuracy Improving
Stonehenge• 5000 years old• Over 100 stones• Complicated stone
alignments • Marks position of the
moon and major stars as well as the sun
• Storage of several observations
33DC Data Science May 2012
Comprehensive Data Sets
34DC Data Science May 2012
Galen• Greek physician• Experimental physiologist• Arabic copy from 800 AD• Pictorial, descriptive,
function describing• Representative of
botanical and animal catalogs
Systematizing a Comprehensive Collection
35DC Data Science May 2012
Pliny the Elder• Roman scholar• Natural History (77
AD)• One of earliest known
encyclopedias of the natural world
• Systemization of data
My Talk
1. Science today2. The Data Revolution in science3. Scientific data and scientific databases4. Data and scientific discovery5. The challenges of using data science on scientific
data
36DC Data Science May 2012
Data and Scientific Discovery
• The advent of the Baconian Revolution –anchoring scientific understanding to physical observation
• Led to databases becoming the foundation of scientific discovery
• True Beginnings of Data Science!
37DC Data Science May 2012
Scientific Databases in History
• Preserved data collections (large and small) form the foundation of scientific discovery
• Trends in data preservation and discovery1. Accuracy2. Comprehensiveness3. Systematizing4. Extraction of essence5. Explanation of the complex6. Prediction of new phenomena!7. Physical theory from data!
38DC Data Science May 2012
Extraction of Essence
Tycho Brahae• Late 16th Century• Danish Astronomer• Made precise
measurements that led to Kepler’s theories
• Led to discovery of simple relationships
39DC Data Science May 2012
Explanation of the Complex
40DC Data Science May 2012
Charles Darwin• Combined with others in
geology, zoology and botany
• A wide variety of facts and phenomena recorded
• Theory of Evolution had to explain many diverse observations and measurements from different disciplines
Prediction of New Phenomena
41DC Data Science May 2012
Mendeleev and the Chemical Periodic TablePredicting properties of unknown elements from properties (data) of known elements
Physical Theory from Data
• Notes on the Spectral Lines of Hydrogen: Johann Jacob Balmer Annalen der Physik und Chemie 25 80-5 (1885) – “I gradually arrived at a formula which, at least for these four
lines, expresses a law by which their wavelengths can be represented by striking precision…From the formula, we obtained for a fifth hydrogen line 3936.65x10-7 mm. “
• The development of quantum mechanics
Bohr
42DC Data Science May 2012
Schrödinger
Brief History of Modern S&T Databases
1950s Crystal structures (software generated data-1960s Neutron data (modeling weapons)
1970s Analytical chemistry (identify chemicals)Thermochemistry (properties linked)Environmental and toxicologyLarge physics experimentsSpace science
1980s Astronomy Materials Earth sciences Biology Genomics
43DC Data Science May 2012
Scientific Databases Today
Preserving” Data is Easy • Database management tools are inexpensive and
powerful• Many models for good interfaces exist• Collecting data (data deposition) can be routine• Expertise is easily available from many sources
Building databases today is remarkably easy
44DC Data Science May 2012
Comprehensive Data Collections for 21st Century Science
• International Virtual Observatory
• Structural Genomics• Proteomics• Climate change• Historic geologic• Chemistry on demand
• Biodiversity• Brain scans
45DC Data Science May 2012
• All observation for every point in the sky
• For all living things!• 30,000 or 300,000?• Water, earth, atmosphere and
all they contain• Many millennia, the entire
planet• 60 elements, 5 at a time,
many ratios, 109 – 1010
compounds• 5M species? or 10M? or 50M?• Every person, every thought
foreverVery large databases will be found in every scientific discipline
The Face of 21st Century Science
• Complex• Multi-disciplinary• Real systems• Virtual as well as physical
Access to quality data becomes critical
Attention to the problems and challenges of long term preservation of and access to data becomes more important than ever!
46DC Data Science May 2012
Scientific Discovery and Data Collections The Paradigm has Changed
Yesterday• Collections managed by a
small number of people• Collections readable by
one scientist• Collections interpretable
by one person
• Discoveries made by thinking, with analysis by one person
47DC Data Science May 2012
Today Collections managed
by groups Collections not
readable by any individual
Collections interpretable only with aid of software
The Future Discoveries aided or
made by computers, with verification by people?
The Proposition
Scientific databases in the future will be even more important source for scientific discovery
• Data collections are critical for– New insights– New scientific principles– New knowledge– Understanding complex systems
Let’s look at 3 problems and the challenges they present
48DC Data Science May 2012
My Talk
1. Science today2. The Data Revolution in science3. Scientific data and scientific databases4. Data and scientific discovery5. The challenges of using data science on
scientific data
49DC Data Science May 2012
Three Problems in the Data Era
1. Too much data2. Complex systems3. Complex science
50DC Data Science May 2012
Scientific Knowledge
Theory Models
Hypotheses Questions
Data
MeasurementData Collections
Science and DataHow the information
revolution has changed their relationship
Scientific Data in the Future
Problem 1: Too much data
The Challenges• How do you look at large volumes of data?• What does data quality mean for large data
collections?• How do you determine which data are important?
51DC Data Science May 2012
Challenge: Too Much Data
• Too much data for any one person to read or understand
Can use• Visualization• Data reduction• Anomalies and outliers
How does anyone read a terabyte of data?
Software must be used to “read” data
Can we allow software to determine what are important data?
52DC Data Science May 2012
Challenge: Too Much Data
Do we have the technology to handle the overwhelming volume of data from new measurement techniques?
• What to capture when we generate too much data too fast?
• How to store, represent, manipulate and display too voluminous data?
• How to find out which data are important?
53DC Data Science May 2012
Challenge: Too Much Data and Data Quality
Evaluating data quality
• How can large amounts of data be evaluated? In real time? As new data are published?
• How can large data sets be integrated together correctly?
• What does quality mean in a terabyte of data? For
Each data point? Each set of points? Sub-collections? An entire collection?
54DC Data Science May 2012
Challenge: Too Much Data and Data Quality
Evaluating data quality
• Bad data quality leads to bad science and bad decisions based on science
• One measurement does not make a property
• Agreement between theory and experiment does not mean both are correct
• In today’s world with terabytes of data, what does quality mean?
55DC Data Science May 2012
Challenge: Modeling and Data Quality
Data Quality
Making accurate virtual measurements on virtual systems
• What is the quality in a calculation?
• How do you establish uncertainty for a calculation?
• Which computational results should be stored, and how can those data be handled?
• How do you discover something new in a mass of computational results?
56DC Data Science May 2012
Some models have mechanisms for assessing quality
HΨ = EΨ Schrödinger equation
The variational principle applies only to energy
Quality of other properties calculated from the equation is unknown
Challenge: Modeling and Data Quality
Documenting Quality of a calculation
Making accurate virtual measurements on virtual systems
• How do you establish uncertainty for each step of a calculational result?
• Science Vol. 336 pp. 159-160 (2012): Software created by public funding must be released, just as with data themselves
57DC Data Science May 2012
1. Model assumptions (which ind. Var. used)
2. Translation into algorithms
3. Coding4. Input5. Finite arithmetic6. Post-processing analysis
Challenge: With Too Much Data, What is Important?
When you have a lot of data, what can you do with it?
Abstraction of important features
• How can we find what is important when we have too much data?
• Or not enough of the correct data?
Truly great science is having the insight of what is important
Can we teach software how to do that?
58DC Data Science May 2012
• 80 trillion cells in body• Number of human proteins is
estimated to be 30,000 – 70,000
• Which are important and why?• At least 150 proteins repair
DNA damage
Scientific Data in the Future
Problem 2: Real systems are very complex
The Challenges• Large number of objects • Large number of independent variables• Changing scientific language• Data Integration
59DC Data Science May 2012
Complexity Challenge: Many Objects
There are too many objects to count, observe, measure, or calculate
• Number of stars• Number of species• Number of chemicals• Number of individuals• Number of rocks• Number of cells• Number of thoughts• Number of ecosystems
• You get the point
60DC Data Science May 2012
Complexity Challenge: Independent Variables
• How do we use metadata (report the relevant independent variables) to describe what we preserve?
• Independent variables are a quantitative mechanism for expressing our knowledge about how and why a phenomenon occurs
• Capturing complete knowledge of independent variables requires a large or (perhaps) even an impossible amount of data
• One goal of research is to understand which variables are important and why
• Our knowledge clearly evolves over time
61DC Data Science May 2012
P=(n/V)RT (Ideal gas law)
Dependent variable P=pressure
Independent variables n/V = number/volume=density T = temperature
Complexity Challenge: Independent Variables
Major challenge in data collections is to capture evolution of knowledge of independent variables
• Must be done in a way as to preserve data set compatibility
• Let’s work through a quick examples of the complexity and how knowledge changes with time
62DC Data Science May 2012
Most data have numerous independent variables they are functions of
Brain Imaging
• Recording techniques evolve and improve over time– X-ray, CT, MRI, PET, next?
• Each technology individually evolves, as do the types of signals collected, their association with brain activity and region
• Monitoring reactions to stimulus: pain, visual, auditory, tactile, etc.
• Details must be defined and recorded
63DC Data Science May 2012
Brain History
• If we imagine the details necessary to describe this, the number of independent variables expand rapidly– Stimuli history– Physiological history– Developmental history– Environmental exposures– Drugs taken– More
• As with the development of unifying theories of the gross physical world – motion, evolution, chemistry, genetics - the details are necessary to find the dominant factors
What are the most important independent variables for recording brain history? Still an open question!
64DC Data Science May 2012
Complexity Challenge: Independent Variables
Modern science requires data from many disciplines
• If we must aggregate different data sets (e.g., over the Web) to do discovery, how do we know data are comparable?
• How do we integrate data sets with varying numbers of independent variables?
• Especially if their names and meaning change over time?
65DC Data Science May 2012
Complexity Challenge: Evolving Language
These are powerful change factors that cannot be ignored in preserving data
How do languages evolve? • Contractions of words• Reordering sentences• Borrowing words• Dropping and adding
startings and endings• Differentiation of concepts• Evolution of concepts
John McWhorter – The Power of Babel
• Ontologies can help
66DC Data Science May 2012
Complexity Challenge: Evolving Language
Time and Scientific Language
• Grammar rules appeared only a few hundred years ago
• Language change factors ignore authority
• Usage wins over regulations every time!– Are terminology standards
actually used?
Are efforts such as that on the right doomed to fail?
67DC Data Science May 2012
Complexity Challenge: Evolving Language
Time and the evolution of Scientific Language
• New knowledge requires new language
• Data preservation efforts must recognize evolution of scientific language
• Not just independent variables and metadata – the scientific language itself
• So if you are going to do “discovery,” you’d better know what you are working with
68DC Data Science May 2012
Complexity Challenge: Data Integration
Developing standards for scientific data and metadata
69DC Data Science May 2012
• What is the business case for such standards?
• How can you standardize scientific language if it continues to evolve?
• How can you determine object equivalency and uniqueness with partial data sets?
• How do you persuade scientists to back off the state-of-the-art to agree on standards?
Complexity Challenge: Data Integration
Data Standards: Making exploitation of large data sets possible
70DC Data Science May 2012
• What standards are needed for making data sets work together?
• How can you trust integrated data sets?
• No science is an island by itself
• Science today is multi-discipline, international, multi-lingual, ever-changing
• Integration can be achieved by standards and clear reporting of measurements
• As knowledge of variables increases, integrating old and new data becomes more difficult
Complexity Challenge: Data Integration
Data Ownership
We must differentiate between discovery and adding value
Observing nature should not lead to data ownership
• Transforming observations through value-added intellectual effort can create IPR
• For scientific data, must be very careful not to restrict use by others
• The same observations led to many different theories of planetary motion – from Aristotle and Ptolemy to Kepler to Newton to today
71DC Data Science May 2012
Complexity Challenge: Data Integration
Maintaining full and open access to the large number of databases required for making new scientific discoveries
72DC Data Science May 2012
• What policies are needed for full and open access?
• Open access aims to provide everyone with the information and data to advance science
• Open is not necessarily free• Long term preservation does cost
money– Data and literature
collections must be supported
• How can discoverers profit from their automated discoveries?
• How do you get the information industry to understand the new paradigm for discovery?
Complexity Challenge: Data Integration
Data Costs
Nothing is ever really free!
• It costs significant money to generate, capture, manage, store, analyze, use, disseminate, and preserve scientific data
• Data costs must be integrated into the cost of generation
Policy and practice will vary from discipline to discipline, but nothing is ever free
73DC Data Science May 2012
Complexity Challenge: Data Integration
Data Repositories
• Started with crystal structure
• Genomics• Other disciplines
following• NSF now requiring
data management plans
• Often required to publish papers
• Curation (everything reported correctly) now automated
Model does not translate easily for evolving fields
74DC Data Science May 2012
Complexity Challenge: Data Integration
Progression of Data Collection
• Individual• Collegial• Institution or discipline
repository• Evaluated data• “Property values”
• Each step requires more metadata to provide adequate documentation
• Very difficult to add metadata after the fact
• For new phenomenon, difficult to know what Ind. Var. are necessary
75DC Data Science May 2012
Scientific Discovery in Preserved Data
Problem 3: Real systems are very complex and complex behavior in systems is difficult to find
The Challenges • How do we recognize real understanding? • What is knowledge discovery in the future?
76DC Data Science May 2012
Challenge: Real Understanding
Real systems are very complex
How can you identify the existence of a unifying theory or concept?
• Could we have derived quantum mechanics from a complete database of atomic and molecular spectra?
• What features does quantum mechanics have beyond these data?
77DC Data Science May 2012
Challenge: Real Understanding
Real systems are very complex
Multiple views of the same phenomena exist
The Simple (?) Laws of Interaction
• String theory• Quantum theory• Matrix mechanics• Maxwell’s theory• Quantum electrodynamics• Newton’s laws of motion
Are all views of nature equally discoverable?
By computer-aided discovery?
78DC Data Science May 2012
Challenge: Real Understanding
How do we develop real understanding?
Real Scientific knowledge?
• Just because we measure a phenomenon do not mean we understand it– Do we know how many
genes there?– Does measuring the
mass of the universe makes us understand dark matter?
How does data lead to understanding?
79DC Data Science May 2012
Challenge: Real Understanding
Knowledge Discovery • Large amounts of data can help find new discoveries
• How to know which data are the most important, the key to discovery
• Hoe to know something is there to be discovered?
• Can too much data make discovery more difficult?
• Will/Can discovery have to be automated?
80DC Data Science May 2012
Data Collections in 21st Century Science
The important thing in science is not so much to obtain new facts as to discover new ways of thinking about them
William Bragg
81DC Data Science May 2012
Some Final Thoughts
Scientific databases in the future will be even more important source for scientific discovery
• Preservation of data needed for– New insights– Scientific principles– New knowledge– Understanding complex systems
The problems and challenges I have just outlined are not insurmountable – just problems and challenges
82DC Data Science May 2012
Some Final Thoughts
Science has changed and with that change, our expectations for science have changed.
We now expect science to be a force for shaping the future, not just understanding nature
Scientific databases in the future will be even more important source for scientific discovery
The Data Revolution has become an enabling force to meet our expectations for 21st Century Science
83DC Data Science May 2012