MOTIVATION Uncertainty is inevitable in any attempt to predict earthquakes as the ultimate cause (mantle convection) is inherently non-deterministic - in situ measurements, however, may reduce this uncertainty The availability of actionable observations is time critical to effective and efficient communications of advisories and warnings The establishment of a cause (earthquake) - effect (tsunami) relationship remains outstanding, and is complicated by multiple factors (e.g., tectonic setting) Far-field estimates of tsunami propagation (pre-computed) and coastal inundation (computed in real time), however, have proven to be extremely accurate - results successfully combine data from a distributed array of deep-ocean tsunami detection buoys with a forecasting model THE 6Vs OF SCIENTIFIC VS. SOCIAL NETWORKING DATA CONCLUSIONS Credible tweets could be transformative - Big Data source that can complement traditional sources (e.g., scientific instruments) Working with 6V Twitter data can be challenging, though it also presents interesting opportunities Curation of training data is extremely important, but also extremely time consuming (as this is a manual process) Current research emphasizes Deep Learning, BUT RDF/OWL semantics will need to play a role ultimately Approach can be generalized for application to natural and anthropogenic disasters of all kinds ACCOUNTING FOR OIL SPILLS AND OTHER DISASTERS … Energy exploration via reflection seismology provides the fundamental source of data that is subsequently processed and interpreted for the identification of potential petroleum reservoirs Reservoir simulation is used to engineer the extraction of petroleum reserves from reservoirs Drilling is used to ‘truth’ the results provided by interpretations and simulations prior to production extraction SOPs ensure extraction of oil from a production reservoir is routinely monitored and reported upon - e.g., to quantify rig safety and output (barrels/day) From exploration to extraction, this is a data-rich workflow Additional data sources become relevant when disasters occur (e.g., oil spills) - from re-purposed scientific instruments (e.g., weather satellites) to social media (e.g., Twitter, Instagram, Snapchat, ...) Data-rich workflows can generate problems in Big Data Analytics Deep Learning Pipeline THE OPPORTUNITY FOR SEMANTICS A feature vector is a feature vector - it is devoid of semantics Ignores inherent, overall credibility of a Tweet - e.g., as quantified by TweetCred Twitter metadata (handles, hashtags and URLs) contributes equally to Twitter data (unstructured text that comprises the body of a Tweet) in constructing feature vectors - i.e., the semantic value of Twitter metadata is also ignored by Deep Learning The W3C’s Resource Description Framework (RDF) facilitates the representation of metadata and thus exposes semantics The W3C’s Web Ontology Language (OWL) accounts for domain specifics - disambiguates use of overloaded terms (e.g., “earthquake”) in different contexts (e.g., geophysics vs. movies vs. …) Deep Learning in combination with RDF/OWL semantics has the potential to produce learned models with knowledge represented MITIGATING DISASTERS WITH DEEP LEARNING FROM TWITTER? WWW.UNIVA.COM Deep Learning pipeline implemented using the Machine Learning Library (MLlib) from Apache Spark scaled onto a converged Big Data/HPC cluster via Univa Universal Resource Broker for featurization, training, evaluation and operational use. Copyright © 2017 Univa® and Grid Engine® are registered trademarks of Univa Corporation 1 3 2 6 4 5 Data extracted from Twitter via a Perl script that targeted the hashtag #earthquake Spark MLlib HashingTF establishes frequency- based usage Spark MLlib Logistic Regression with SGD classifies spam vs. ham Recent ‘earthquake’ data from Twitter used to evaluate model Featurization Training Model Evaluation Feature Vectors Training Data Twitter data manually curated into ‘ham’ and ‘spam’, then represented in-memory via Spark RDDs SPAM HAM Model Best Model + + + – – – + + + – – – + + + – – – + + + – – – + + + – – – 1000+ Apps Data Sources Univa Universal Resource Broker Univa Grid Engine Scheduler API Command Line Spark UIs Data Frames ML Pipelines MLib GraphX Spark Streaming Spark Core Spark SQL Volume Variety Velocity Veracity Validity Volatility small'ish, finite semi-structured, restricted slow, sampled low (stationary, irreplaceable) BIG, ‘infinite’ unstructured, unrestricted - except for handles, hashtags & URLs (pages, images) fast, streamed high? (mobile? disposable?) Traditional Scientific Data Twitter Data Created at: Wed Jun 04 20:29:33 +0000 2014 5.0 earthquake! Thu Jun 05 02:04:27 GMT+09:00 2014 near 84km SW of Iquique, Chile http://t.co/mmFokGQWT7 #earthquake Created at: Wed Jun 04 20:30:13 +0000 2014 The #earthquake continues: Latest via @Spectator_CH /@YouGov -#Labour 36 #Tories 32%, LD 8%, #Ukip 14%. Implied Labour majority- 42 . Created at: Wed Jun 04 20:31:35 +0000 2014 #terremoto ML 2.7 CENTRAL ITALY: Magnitude ††ML 2.7 Region ††CENTRAL ITALY Date time ††2014-06-04 20:01:33.9 UTC... http://t.co/Y141Ovu6kP Created at: Tue Jun 10 12:22:34 +0000 2014 RT @TheRock: Just wrapped a massive post earthquake scene for SAN ANDREAS. To the hundreds of background actors/extras.. THANK U for all yo... biases, noise & abnormalities accuracy & correctness April 16, 2016 01:25 JST Magnitude 7.1 earthquake Kyushu area, 10 km depth 01:27 JST Tsunami advisories issues • Imminent arrival • ~ 1m maximum height 01:29 JST High-tide amplification advisory 04:43 - 04:54 JST High-tide times 01:40 JST Estimated tsunami first arrives 02:14 JST Tsunami advisories lifted x Japan Meteorological Agency