Top Banner
It’s a Streaming World: Doing Analytics on Data in Motion MARK GREAVES Technical Director, Analytics National Security Directorate Pacific Northwest National Laboratory February 2016 PNNL-SA-116502
18

AIM NIAC PNNL-SA-116502

Feb 17, 2017

Download

Documents

Mark Greaves
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: AIM NIAC PNNL-SA-116502

It’s a Streaming World: Doing

Analytics on Data in Motion

MARK GREAVES

Technical Director, Analytics

National Security Directorate

Pacific Northwest National Laboratory

February 2016

PNNL-SA-116502

Page 2: AIM NIAC PNNL-SA-116502

Context

The digital reflection of reality is sharpening thanks to:

the pervasive deployment

of sensors in our cities

the wide adoption of smart

phones (equipped with sensors)

the usage of (location-based)

social networks

the availability of datasets

about urban environment

[source E. Della Valle - http://streamreasoning.org/]

PNNL-SA-116502

Page 3: AIM NIAC PNNL-SA-116502

What are Data Streams Anyway?

Formally

Data streams are unbounded sequences of time-varying data elements

Less formally

An (almost) “continuous” flow of information

Key Assumptions

Recent information is more relevant because it describes the current state

of a dynamic system

Streams focus on extracting value from transient data consumed on the

fly by continuous queries

time

[source E. Della Valle - http://streamreasoning.org/]

PNNL-SA-116502

Page 4: AIM NIAC PNNL-SA-116502

Leveraging Streams Using Continuous

Queries over Stream Windows

“Official” streams from transportation, utilities, fire, police, parks,

cameras, events, housing, neighborhoods, emergency mgmt…

“Unofficial” streams from businesses, citizen reports, social media…

window

input streams streams of answersRegistered Continuous Query

Urban System

[graphic E. Della Valle - http://streamreasoning.org/]

PNNL-SA-116502

Page 5: AIM NIAC PNNL-SA-116502

Pros and Cons of Continuous Query

Pros

Robust: Leverages mature database techniques for replication, security,

alerting, performance tuning, view creation, report generation, etc.

Deployable: Relatively cloud-friendly architecture

Predictable: Great for well-defined and/or slowly-changing problems

Cons

Classic data issues: Same old problems about noisy, incomplete,

unreliable, and heterogeneous data

Difficult to steer: Significant cost/time in formulating, testing, refining, and

updating the queries

Too low-level: More of a data reduction solution than an analytics solution

One-direction pipeline for a human consumer

No implicit support for human background knowledge

Complex summarization tasks can exceed the window size

Historical data often has to be handled separately

PNNL-SA-116502

Page 6: AIM NIAC PNNL-SA-116502

PNNL’s Analytics in Motion Initiative

AIM: A 5-year Lab-wide effort to advance the state of the art in

Interactive Streaming Analytics at Scale

Interactive

Humans in/on the Loop, actively steering and using their knowledge

New interaction techniques beyond the ticker tape

Address steering, cost of analytic algorithms, higher level information, 7agility

Analytics

From the “what” to the “why”: sensemaking, decision support, causality

Statistical normalization/summarization/outlier techniques

AI methods that dynamically incorporate human background knowledge

~10 Coordinated R&D projects per year, with university and

commercial partners (and always looking for more)!

Experimental cloud-based testbed

Increasingly-difficult set of use cases to contextualize the R&D

Distinguished Advisory BoardPNNL-SA-116502

Page 7: AIM NIAC PNNL-SA-116502

How do we rebalance effort between humans and machines?

How can we automate the hypothesis generation and testing process?

How do we capture human insight in situ from streamingdata sources?

Can we steer measurementsystems automatically based on emerging knowledge?

PNNL-SA-116502

Page 8: AIM NIAC PNNL-SA-116502

AIM Overview

AIM is developing new techniques for interactive streaming analytics,

tracking a stream in real-time and using human input to guide

computational models

Multiple classifier systems, with diverse model types (e.g., symbolic and PGMs)

Use high-level dynamic user feedback to steer the data production system, provide

model tunings/weightings/rankings, and fuse results

Key features of AIM’s streaming model

Data is forgotten: Each model’s cache is small relative to the data volume

Single-pass: No access to the data stream beyond the sample

Cooperative user: Important problem knowledge isn’t in training data

Not the whole system: Lambda embeddingPNNL-SA-116502

Page 9: AIM NIAC PNNL-SA-116502

AIM Programmatic Approach

Four AIM program focus areas (% of FY16 budget)

Streaming Data Characterization and Processing (20% of R&D)

Hypothesis Generation and Testing (30% of R&D)

Human-Machine Feedback (50% of R&D)

Infrastructure and Testing Environment (25% of operations)

PNNL-SA-116502

Page 10: AIM NIAC PNNL-SA-116502

AIM FY16 Project Layout

Streaming Data Characterization

SFE: Scalable Feature Extraction and Sampling

CA: Compressive Analysis

Hypothesis Generation and Test

SDC: Streaming Data Characterization

NOUS: Streaming Knowledge Graphs

TeMpSA: Temporal Modeling in Streaming

Analytics

SAFE: Stream Adapted Foraging for Evidence

Human-Machine Feedback

UCHD: User-Centered Hypothesis Definition

TECSSD: Towards Enabling Complex Sensemaking

from Streaming Data

Transpire: Transparent Model-Driven Discovery of

Streaming Patterns

CD: Cognitive Depletion

Infrastructure and Test

AIM Software Infrastructure

*

= FY15 Start*

*

**

*

*

SFECA

SDCNOUS

TeMpSAUCHD

TECSSD

SAFE

Transpire

CD

PNNL-SA-116502

Page 11: AIM NIAC PNNL-SA-116502

AIM Projects by Year and Technical Focus

SoI: Science of Interaction

OPA: Online Predictive Analytics

SHyRe: Scalable Hypothesis Reasoning

SFE: Scalable Feature Extraction and Sampling

CA: Compression Analysis

CD: Cognitive Depletion

PoP: Population-based Model Selection

NOUS

UCHD: User-Centered Hypothesis Definition

ASI: AIM Software Infrastructure

TeMpSA: Temporal Modeling in Streaming Analytics

SAFE: Stream Adaptive Foraging for Evidence

TECSSD: Toward Enabling Complex Sensemaking from Streaming Data

Transpire: Transparent Model-Driven Discovery

SDC: Streaming Data Characterization

TeMpSA

CDUCHD

SoI

Symbolic Reasoning

StatisticalData Mining

Human ComputerInteraction

SFESHyRe

CA

OPAPoPNOUS

ASI

TECSSD

Transpire

TeMpSA

SAFE

SDC

PNNL-SA-116502

Page 12: AIM NIAC PNNL-SA-116502

How Will AIM Measure Streaming Capability?

Insight is a tradeoff between accuracy, throughput, and utility

Accuracy: AIM systems will converge to correct interpretations, under

two gold standards -- compared to the known state of the world as

reflected in the data, and compared to reference static analytic algorithms

running over the total data

Hypothesis: F1 (precision/recall) measures will be greater than with

algorithms alone or humans alone

Utility: AIM systems will provide stream interpretations that usefully

support insight in users, based on their needs, tasks, roles, and interests

Hypotheses: Users will be able to usefully guide streaming

classifiers; correct human interpretations will occur earlier in the

stream with AIM

Throughput: AIM systems will ingest streams and yield judgments at

rates sufficient for the problem domain

Hypothesis: AIM will achieve insight at a rate that exceeds current

technology-aided human baseline

PNNL-SA-116502

Page 13: AIM NIAC PNNL-SA-116502

AIM Use Cases

NMR and Metabolomics

Goal: More rapid metabolite identification in a bioreactor

User: Operator provides important background knowledge

Stream comes from NMR machine as spectral data

Processing algorithms propose specific metabolites and track

concentration changes

Strategic Surprise (OODA Loop Modeling)

Goal: Detect Line of Business (LOB) change in export data

User: Domain expert in company

Stream PIERS subset at high rates, produce hypotheses about LOB

changes

Cloud Cyber

Goal: Streaming telemetry data from PNNL IRC and other sources for

detection of cyber exploits; LAS partnership

User: Real-time cyber defenders

Electron Microscopy

Goal: Detection of events and anomalies in real-time EM imagery

User: Microscope operator

FY1

5

FY1

6

PNNL-SA-116502

Page 14: AIM NIAC PNNL-SA-116502

AIM FY15 NMR Use Case

Use case goalsBuild on FY14’s fast streaming compound

ID from partial spectra

Prepare for a FY16 Microscopy Use CaseEvaluate imagery on-the-fly

Evaluate and communicate residual signal

to user with potential solutions

Scientist interaction and intervention within

the experiment

Participating FY15 AIM projects

CA: Compression Analysis

SFE: Scalable Feature Extraction

OPA: Online Predictive Analytics

SoI: Science of Interaction

ASI: AIM Software Infrastructure

PNNL-SA-116502

Page 15: AIM NIAC PNNL-SA-116502

AIM FY15 Strategic Surprise Use Case

Use case goals

Model an OODA loop with streaming data

Real (dirty) data

Employ user feedback and rationale

Provide “hello world” for projects

Straightforward to modify stream rate or data for different experimental scenarios

“Frankencompanies”

Participating FY15 AIM projects

SFE: Scalable Feature Extraction

NOUS

POP: Population-based Model Selection

OPA: Online Predictive Analytics

SHyRe: Scalable Hypothesis Reasoning

SoI: Science of Interaction

ASI: AIM Software Infrastructure

UCHD: User-Centered Hypothesis Definition

UCHD

PNNL-SA-116502

Page 16: AIM NIAC PNNL-SA-116502

AIM FY16 Use Case: Cyber Defense

Use case parametersInput data from cloud telemetry (Digital Signatures), other sensors and data

sources

Sampling and processing algorithms specialized to cyber data

Hypotheses are indicators of attack or nonstandard system behavior

Strong cyber defender involvement to provide cyber knowledge and cyber defense

tradeoffs

We are intensively working with OGAs on streaming cyber security use

cases for active mitigationBoth data stream processing and cyber defender-in-the-loop

Fast cyber data

Detect and IDthreats

Possible attacks with evidence

Supportcyber defender

tradeoffs

Cyber defenseactions

PNNL-SA-116502

Page 17: AIM NIAC PNNL-SA-116502

AIM Infrastructure and Testing Environment

Key objectives

Provide individual AIM algorithms with common initiative data sets

Characterize integrated AIM algorithm performance and tradeoffs

Measure overall accuracy, speed, and throughput

Status

Landscape analysis settled on Kafka and LIFT/Avro

650K msgs/sec single-threaded streaming

Built on PNNL’s new Institutional Research Cloud (IRC)

Support possible USG transition and packages like Spark

Fault tolerance, topic partitioning, load balancing

Support of AIM use cases

Software assistance to individual AIM projects

Confluent coordination

PNNL Living Laboratory

Policy and process for full lifecycle management of our own

internal data when used in researchPNNL-SA-116502

Page 18: AIM NIAC PNNL-SA-116502

Mark GreavesTechnical Director for Analytics

NATIONAL SECURITY DIRECTORATE

Phone: (206) 528-3300

Mobile: (206) 972-2201

[email protected]

www.pnnl.gov

aim.pnnl.gov

PNNL-SA-116502