Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Social Applications

Post on 07-May-2015

2887 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation at the AAAI 2013 Fall Symposium on Semantics for Big Data, Arlington, Virginia, November 15-17, 2013 Additional related material at: http://wiki.knoesis.org/index.php/Smart_Data Related paper at: http://www.knoesis.org/library/resource.php?id=1903 Abstract: We discuss the nature of Big Data and address the role of semantics in analyzing and processing Big Data that arises in the context of Physical-Cyber-Social Systems. We organize our research around the five V's of Big Data, where four of the Vs are harnessed to produce the fifth V - value. To handle the challenge of Volume, we advocate semantic perception that can convert low-level observational data to higher-level abstractions more suitable for decision-making. To handle the challenge of Variety, we resort to the use of semantic models and annotations of data so that much of the intelligent processing can be done at a level independent of heterogeneity of data formats and media. To handle the challenge of Velocity, we seek to use continuous semantics capability to dynamically create event or situation specific models and recognize new concepts, entities and facts. To handle Veracity, we explore the formalization of trust models and approaches to glean trustworthiness. The above four Vs of Big Data are harnessed by the semantics-empowered analytics to derive Value for supporting practical applications transcending physical-cyber-social continuum.

Transcript

Semantics-empowered Big Data Processing for PCS ApplicationsKrishnaprasad Thirunarayan (T. K. Prasad) and Amit Sheth

Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing

Wright State University, Dayton, OH-45435

Prasad 2

Outline

• 5 V’s of Big Data Research

• Semantic Perception for Scalability

• Lightweight semantics to manage heterogeneity – Cost-benefit trade-off and continuum

• Hybrid Knowledge Representation and Reasoning– Anomaly, Correlation, Causation

11/15/2013

Prasad 3

5V’s of Big Data Research

Volume

Velocity

Variety

Veracity

Value

11/15/2013

Big Data => Smart Data

Prasad 4

Volume : Assorted Examples

• 25+ billion sensors have been deployed.

• About 250TB of sensor data are generated for a NY-LA flight on Boeing 737.

• Parkinson disease dataset that tracked 16 patients with mobile phone using 7 sensors over 8 weeks is 12GB.

Check engine light analogy11/15/2013

Prasad 5

Volume : Semantic Perception

• Abstracting machine-sensed data – E.g., fine-grained to coarse-grained– E.g., average, peak, rate of change

• Extracting human-comprehensible features/entities• Machine perception

– Derive conclusions using domain models

and hybrid abductive/deductive reasoning

Goal: Human accessible situational awareness and actionable intelligence for decision making

11/15/2013

Prasad 6

Weather Use Case

• Machine-sensed phenomenon– temperature, precipitation, humidity, wind speed, etc.

• Human perceived features– blizzard, flurry, rain storm, clear, etc.– categories of hurricanes (SSHWS)

• Machine perception– Using domain models from NOAA

• Ultimately, generate weather alerts …

11/15/2013

Prasad 7

Parkinson’s Disease Use Case

• Data from machine-sensors– accelerometer, GPS, compass, microphone, etc.

• Human perceived features– tremors, walking style, balance, slurred speech, etc.

• Machine perception– Using domain models to be created to diagnose and

monitor disease progression

• Ultimately, recommend options to control chronic conditions …

11/15/2013

Prasad 8

Heart Failure Use Case

• Machine-sensed data – weight, heart rate, blood pressure, oxygen level, etc.

• Human perceived features– Risk-level for hospital readmission of CHF/ADHR patient

• Machine perception– Using domain models to be created to monitor heart

condition of a cardiac patient post hospital discharge

• Ultimately, recommend treatments to reduce preventable hospital readmissions …

11/15/2013

Prasad 9

Asthma Use Case

• Data from machine-sensors– Environmental sensors, physiological sensors, etc.

• Human perceived features– Asthma severity gleaned from frequency of asthma

attacks, wheezing, coughing, sleeplessness, etc.

• Machine perception– Using domain models to be created to monitor asthma

patients and their surroundings

• Ultimately, recommend prevention and control options …

11/15/2013

Prasad 10

Traffic Use Case

• Data from machine-sensors, social media stream, and planned event schedules– Traffic flow sensors : link speed, link volume, Event-

specific tweets, etc.

• Human perceived features– traffic delays and congestion, etc.

• Machine perception– Using domain models to be created to understand traffic

patterns in response to events

• Ultimately, recommend traffic management options …

11/15/2013

Slow moving traffic

Link Description

Scheduled Event

Scheduled Event

511.org

511.org

Schedule Information

511.org

Traffic Monitoring

11

Heterogeneity in a Physical-Cyber-Social System

Prasad 12

Volume with a Twist

Resource-constrained reasoning on mobile-devices

Goal: Boolean encodings to ensure feasibility, efficiency, and economy

11/15/2013

13* based on Neisser’s cognitive model of perception

ObserveProperty

PerceiveFeature

Explanation

Discrimination

1

2

Perception Cycle* that exploits background knowledge / domain models

Abstracting raw data for human

comprehension

Focus generation for disambiguation and action(incl. human in the loop)

Prior Knowledge

Virtues of Our Approach to Semantic Perception

Blends simplicity, effectiveness, and scalability.

• Declarative specification of explanation and discrimination;

• With applications (e.g., to healthcare) that are of contemporary relevance and interdisciplinary;

• Using encodings/algorithms that are significant (asymptotic order of magnitude gain) and necessary (“tractable” due to time/memory reduction for typical problem sizes); and

• Prototyped using extant PCs and mobile devices.

O(n3) < x < O(n4) O(n)

Efficiency Improvement

• Problem size increased from 10’s to 1000’s of nodes• Time reduced from minutes to milliseconds• Complexity growth reduced from polynomial to

linear

Evaluation on a mobile device

15

Prasad 16

Volume and Velocity

• Lightweight semantics-based Adaptive/Continuous Filtering

E.g.,: Track evolution of crowd-sourced and verified Wikipedia event pages for relevance ranking of Twitter hashtags in Disaster response use-case

• Building domain models dynamically

11/15/2013

Heliopolis is a suburb of

Cairo.

Dynamic Model Creation

Continuous Semantics 17

Variety

Syntactic and semantic heterogeneity • in textual and sensor data, • in (legacy) materials data• in (long tail) geosciences data

Idea: Semantics-empowered integration

11/15/2013 Prasad 18

Prasad 19

Variety (What?): Materials/Geosciences Use Case

• Structured Data (e.g., relational)

• Semi-structured, Heterogeneous Documents (e.g., Publications and technical specs, which usually include text, numerics, maps and images)

• Tabular data (e.g., ad hoc spreadsheets and complex tables incorporating “irregular” entries)

11/15/2013

20

Variety (How?/Why?): Granularity of Semantics & Applications

• Lightweight semantics: File and document-level annotation to enable discovery and sharing

• Richer semantics: Data-level annotation and extraction for semantic search and summarization

• Fine-grained semantics: Data integration, interoperability and reasoning in Linked Open Data

Cost-benefit trade-off and continuum

Prasad 21

Challenges Associated with Typical Spreadsheet/Table

• Meant for human consumption • Irregular :

– Not simple rectangular grid• Heterogeneous

– All rows not interpreted similarly• Complex

– Meaning of each row and each column context dependent • Footnotes modify meaning of entries (esp. in materials

and process specifications)

11/15/2013

22

Prasad 23

Practical Semi-Automatic Content Extraction

• DESIGN: Develop regular data structures that can be used to formalize tabular information.– Provide a natural expression of data – Provide semantics to data, thereby removing potential

ambiguities– Enable automatic translation

• USE: Manual population of regular tables and automatic translation into LOD

11/15/2013

Variety (What?) : Sensor Data Use Case

Develop/learn domain models to exploit complementary and corroborative information

• To relate patterns in multimodal data to “situation”

• To integrate machine sensed and human sensed data

11/15/2013 Prasad 24

Variety: Hybrid KRR

Blending data-driven models with declarative knowledge – Data-driven: Bottom-up, correlation-based,

statistical– Declarative: Top-down, causal/taxonomical,

logical– Refine structure to better estimate parameters

E.g., Traffic Analytics using PGMs + KBs

11/15/2013 Prasad 25

Variety (Why?): Hybrid KRR

Data can help compensate for our overconfidence in our own intuitions and reduce the extent to which our desires distort our perceptions.

-- David Brooks of New York Times

However, inferred correlations require clear justification that they are not coincidental, to inspire confidence.

11/15/2013 Prasad 26

• Correlations due to common cause or origin

• Coincidental due to data skew or misrepresentation

• Coincidental new discovery

• Strong correlation vs causation

• Anomalous and accidental

• Correlation turning into causations

Correlations vs Causation vs Anomalies

11/15/2013 Prasad 27

• Correlations Due to common cause or origin– E.g., Planets: Copernicus > Kepler > Newton > Einstein

• Coincidental due to data skew or misrepresentation – E.g., Tall policy claims made by politicians!

• Coincidental new discovery– E.g., Hurricanes and Strawberry Pop-Tarts Sales

• Strong correlation vs causation– E.g., Spicy foods vs Helicobacter Pyroli : Stomach Ulcers

• Anomalous and accidental– E.g., CO2 levels and Obesity

• Correlation turning into causations– E.g., Pavlovian learning: conditional reflex

Correlations vs Causation vs Anomalies

11/15/2013 Prasad 28

• Correlations Due to common cause or origin– E.g., Planets: Copernicus > Kepler > Newton > Einstein

• Coincidental due to data skew or misrepresentation – E.g., Tall policy claims made by politicians!

• Coincidental new discovery– E.g., Hurricanes and Strawberry Pop-Tarts Sales

• Strong correlation vs causation– E.g., Spicy foods vs Helicobacter Pyroli : Stomach Ulcers

• Anomalous and accidental– E.g., CO2 levels and Obesity

• Correlation turning into causations– E.g., Pavlovian learning: conditional reflex

Paradoxes: The Seeds of Progress

Correlations vs Causation vs Anomalies

11/15/2013 Prasad 29

Veracity

Lot of existing work on Trust ontologies, metrics and models, and on Provenance tracking

• Homogeneous data: Statistical techniques• Heterogeneous data: Semantic models

Open Problem: Develop semantics of trust using expressive frameworks that are both declarative and computational • To make explicit all aspects that go into trust

formation, to inspire confidence in inferences

11/15/2013 Prasad 30

Veracity

Machine sensing: objective, quantitative,

but prone to environmental effects, battery life, …

Human sensing: subjective, qualitative,

but prone to bias, perceptual errors, rumors, …

Open problem: Improving trustworthiness by combining machine sensing and human sensing– E.g., 2002 Überlingen mid-air collision :Pilot incorrectly

using Traffic controller advice over electronic TCAS system recommendation

11/15/2013 Prasad 31

(More on) Value

Learning domain models from “big data” for prediction

E.g., Harnessing Twitter "Big Data" for Automatic Emotion Identification

Idea: Exploit “emotion-hashtagged” tweets as training dataset

11/15/2013 Prasad 32

(More on) Value

Discovering gaps and enriching domain models using data

E.g., Data driven knowledge acquisition method for domain knowledge enrichment in the healthcare

Idea: Use associations between diseases, symptoms and medications in EMR documents

11/15/2013 Prasad 33

Prasad 34

Conclusions

• Glimpse of our research organized around

the 5 V’s of Big Data• Discussed role in harnessing Value

– Semantic Perception (Volume)– Continuum of Semantic models to manage

Heterogeneity (Variety)– Hybrid KRR: Probabilistic + Logical (Variety)– Continuous Semantics (Velocity)– Trust Models (Veracity)

11/15/2013

Prasad35

thank you, and please visit us at

http://knoesis.org/

Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled ComputingWright State University, Dayton, Ohio, USA

Kno.e.sis

11/15/2013

Special Thanks to: Pramod Anantharam and Cory Henson

top related