Top Banner
© 2014 IBM Corporation The Power of Declarative Analytics August 30, 2014 Acknowledgement: Shiv, Sekar, Fred, Laura, Berthold, and many more to list here. Yunyao Li IBM Almaden Research Center
47

The Power of Declarative Analytics

Dec 05, 2014

Download

Data & Analytics

Yunyao Li

Invited Talk at Modern Data Management Systems Summit on August 29-30, 2014 at Tsinghua University in Beijing, China.
http://ise.thss.tsinghua.edu.cn/MDMS/English/program.jsp

Abstract:
Modern enterprises are increasingly relying on complex analyses on large data sets to drive business decisions. Tasks such as root cause analysis from system logs and lead generation based on social media, customer retention and digital marketing are rapidly gaining importance. These applications generally consist of three major analytic phases: text analytics, semi-structured data processing (joins, group-by, aggregation), and statistical/predictive modeling. The size of the datasets in conjunction with the complexity of the analysis necessitates large-scale distributed processing of the analytical algorithms. At IBM we are building tools and technologies based on declarative languages to support each of these analytic phases. The declarative nature of the language abstracts away the need for programmer-optimization. Furthermore, the syntax of these languages is designed to appeal to the corresponding communities. As an example for statistical modeling, we expose a high-level language with syntax similar to R -- a very popular statistical processing language.
In this talk I will give an overview of some real-world big data applications we are currently working on and use that to motivate the need for declarative analytics consisting of the three major phases discussed above. I will then describe, in some detail, declarative systems for text analytics along with a discussion on speeds, feeds and comparisons.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Power of Declarative Analytics

© 2014 IBM Corporation

The Power of Declarative Analytics

August 30, 2014

Acknowledgement: Shiv, Sekar, Fred, Laura, Berthold, and many more to list here.

Yunyao Li

IBM Almaden Research Center

Page 2: The Power of Declarative Analytics

© 2014 IBM Corporation

Unlocking the value from big data

Page 3: The Power of Declarative Analytics

© 2014 IBM Corporation

Case Study: Sentiment Analysis

Text

Analytics

Product catalog, Customer Master Data, …

Social Media

• Products

Interests 360o Profile

• Relationships• Personal

Attributes

• Life

Events

Statistical

Analysis,

Report Gen.

Bank

6

23%

Bank

5

20%

Bank

4

21%

Bank

3

5%

Bank

2

15%

Bank

1

15%

Customer 360º3

Who can we cross/up sell?

What are our customers

thinking of our brand?

What do our customers want?

Page 4: The Power of Declarative Analytics

© 2014 IBM Corporation

0000----IIII IIIIIIII IIIIIIIIIIII IVIVIVIV

Time to develop a drug; 12 -15 years

Avg cost to develop a drug: 1.2 billion

Adapted from PhRMA (Pharmaceutical Research and Manufacturers of America) 2013 profile

Laboratory50,000 +

Compounds

Pre-Clinical250 Compounds

Clinical5 Compounds

FDA approval1 compound

“..Toxicity and Serious Adverse Events in Late Stage Drug Development

are the Major Causes of Drug Failure”

Intervention should happen at critical transition bottlenecks between stages

(most likely to impact outcome)

Drug Development Pipeline

Page 5: The Power of Declarative Analytics

© 2014 IBM Corporation

Structure

Indication

Containdication

Mode of action

/ target

Effect level

Side Effects

Data-driven decision making�More efficient clinical trial design, data analytics and drug success / failure predictions

'Structured and Structured and Structured and Structured and

unstructured data sourcesunstructured data sourcesunstructured data sourcesunstructured data sources

Case Study: Drug Discovery

Page 6: The Power of Declarative Analytics

© 2014 IBM Corporation

Case Study: Water Cost Index

• Financial reports

• News feeds

• Websites

• …

What is the cost of water

in different regions?

Financial Analytics

Who care about the cost of water?• Water agencies: Improve credit profile for water infrastructure projects

• Lenders: Better estimate cost and profits of such projects

• Insurers: Better understand underlying risk of such projects

• Consumers: Access water at an affordable price despite of increasing population and demand for water

• Provides market

benchmark

• Spurs growth of financial

products for both water

producers and investors

Page 7: The Power of Declarative Analytics

© 2014 IBM Corporation

Case Study: Water Cost Index

• Financial reports

• News feeds

• Websites

• …

Statistical

Analysis

Water Cost Index

• Uganda signed up as 1st

customer for WCI

Text

Analytics

Financial Analytics

• WCI published on ongoing

basis starting end of Sep.

2013

• Wall Street Journal article

on WCI

7

Page 8: The Power of Declarative Analytics

© 2014 IBM Corporation

What is this talk about ?What is this talk about ?What is this talk about ?What is this talk about ?

�What makes analytics tasks difficult and what can be learnt the

success of relational systems

�Brief description of declarative systems being built at IBM for

√ Information Extraction (SystemT)

√ Machine Learning (SystemML)

X Entity Resolution (DeeR)

8 IBM Research – Almaden9/10/2014

Data integrationStatistic Analysis

/Machine Learning

Information

Extraction

Databases

Semi-/Unstructured Documents

Page 9: The Power of Declarative Analytics

© 2014 IBM Corporation9 IBM Research – Almaden IBM Confidential9/10/2014

Challenges in Information Extraction

.....……………………….……………….Laura Haas

works for IBM in

San Jose, CA. ….………………….…..…………………

InformationExtraction

Person Org Loc

Laura Haas IBM San Jose,CA

Example: Named Entity RecognitionNamed Entity Hierarchy:Named Entity Hierarchy:Named Entity Hierarchy:Named Entity Hierarchy:

200+ types: Person, Organization, Location,…

Page 10: The Power of Declarative Analytics

© 2014 IBM Corporation10 IBM Research – Almaden IBM Confidential9/10/2014

Challenges in Information Extraction

BreadthBreadthBreadthBreadthWide varieties of extraction tasks

Named Entity Hierarchy:Named Entity Hierarchy:Named Entity Hierarchy:Named Entity Hierarchy:

200+ types: Person, Organization, Location,…

• Collecting dictionariesCollecting dictionariesCollecting dictionariesCollecting dictionaries

• Writing regular expressionsWriting regular expressionsWriting regular expressionsWriting regular expressions

• Collecting other wordCollecting other wordCollecting other wordCollecting other word----level featureslevel featureslevel featureslevel features

Labeling + training/tuning machine learning modelsLabeling + training/tuning machine learning modelsLabeling + training/tuning machine learning modelsLabeling + training/tuning machine learning models

orororor

Writing + testing rules Writing + testing rules Writing + testing rules Writing + testing rules

IE development takes effort!!IE development takes effort!!IE development takes effort!!IE development takes effort!!

Page 11: The Power of Declarative Analytics

© 2014 IBM Corporation11 IBM Research – Almaden IBM Confidential9/10/2014

Challenges in Information Extraction

BreadthBreadthBreadthBreadthWide varieties of extraction tasks

Named Entity Hierarchy:Named Entity Hierarchy:Named Entity Hierarchy:Named Entity Hierarchy:

200+ types: Person, Organization, Location,…200+ types: Person, Organization, Location,…200+ types: Person, Organization, Location,…200+ types: Person, Organization, Location,…

IE development takes effort!!IE development takes effort!!IE development takes effort!!IE development takes effort!!

Domain customizations is usually required!!Domain customizations is usually required!!Domain customizations is usually required!!Domain customizations is usually required!!

ComplexityComplexityComplexityComplexityIn development & customization

… Pres. Barack Obama arrived

today at the White House …

Entity Boundary: Person or Person or Person or Person or

Position + Person ?Position + Person ?Position + Person ?Position + Person ?

Entity Definition: Location/Facility/Organization?Location/Facility/Organization?Location/Facility/Organization?Location/Facility/Organization?

Page 12: The Power of Declarative Analytics

© 2014 IBM Corporation12 IBM Research – Almaden IBM Confidential9/10/2014

Challenges in Information Extraction

BreadthBreadthBreadthBreadthWide varieties of extraction tasks

Named Entity Hierarchy:Named Entity Hierarchy:Named Entity Hierarchy:Named Entity Hierarchy:

200+ types: Person, Organization, Location,…

IE development takes effort!!IE development takes effort!!IE development takes effort!!IE development takes effort!!

… Pres. Barack Obama arrived

today at the White House …

Entity Boundary: Person or Person or Person or Person or

Position + Person ?Position + Person ?Position + Person ?Position + Person ?

Domain customizations is usually required!!Domain customizations is usually required!!Domain customizations is usually required!!Domain customizations is usually required!!

ComplexityComplexityComplexityComplexityIn development & customization

Entity Definition: Location/Facility/Organization?Location/Facility/Organization?Location/Facility/Organization?Location/Facility/Organization?

State-of-the-art Open-Source Rule-based

System

• 80,000+ dictionary entries

• 4,800 lines of JAPE and Java code

• Accuracy (English): 50%-80%

• Performance: 20KB/sec, 8GB RAM

State-of-the-art Machine-learning system

• Combination of 4 classifiers

• 150,000+ dictionary entries

• 15+ regexes for word features

• Accuracy: 89%

• Throughput: ~ 10 KB/sec

ScaleScaleScaleScale450M+ tweets per day, …

Page 13: The Power of Declarative Analytics

© 2014 IBM Corporation

Challenges in Scalable Machine LearningChallenges in Scalable Machine LearningChallenges in Scalable Machine LearningChallenges in Scalable Machine Learning

13 IBM Research – Almaden9/10/2014

V W H≈

docu

men

ts

topicsto

pics

words

x

• Billions of non-zeros within tens of hours• Careful partitioning of data• Maximize data locality and parallelism

[Liu, WWW 2010]

~1500 lines of Java code

% initialize W, Hwhile (~converged)

W = W*(V%*%t(H))/(W%*%H%*%t(H))

H = H*(t(W)%*%V)/(t(W)%*%W%*%H)

end

W = W*max(V%*%t(H) – alphaW JW, 0)/(W%*%H%*%t(H))

H = H*max(t(W)%*%V – alphaH JH, 0)/(t(W)%*%W%*%H)

W = W*((S*V)%*%t(H))/((S*(W%*%H))%*%t(H))

H = H*(t(W)%*%(S*V))/(t(W)%*%(S*(W%*%H)))

RegularizersJW,JH

Weighted Sq Loss/Matrix Completion Setting

Parallel implementation is half the story ! Typical application requires experimenting with multiple variants

W = W*(V/(W%*%H) %*% t(H))/(E*%t(H))

H = H*(t(W)%*%(V/(W%*%H)))/(t(W)%*%E)

Different Loss functionKL-divergenceBreadthBreadthBreadthBreadth

Wide varieties of ML models

ComplexityComplexityComplexityComplexityIn implementation

ScaleScaleScaleScale450M+ tweets per day, …

Page 14: The Power of Declarative Analytics

© 2014 IBM CorporationIBM Research – Almaden

What is common across these analytics tasks ?What is common across these analytics tasks ?What is common across these analytics tasks ?What is common across these analytics tasks ?

� Variety of problems and solutions–Every customer’s data & problems are unique in some way

–Need to quickly implement new business logic–Need to experiment with multiple algorithms for a particular analytic problem

� Quality of answers is very important !!

–High quality analytics requires “complex” programs–Skilled developers + domain experts

� Performance is critical

–Bigger data demands faster execution• Social Media:

Twitter alone has 400M+ messages / day; 1TB+ per day

• Financial Data:

SEC alone has 20M+ filings, several TBs of data, with documents range from few KBs to few MBs

• Machine Data:

One application server under moderate load at medium logging level �1GB of logs per day

ComplexityComplexityComplexityComplexity

ScaleScaleScaleScale

BreadthBreadthBreadthBreadth

Page 15: The Power of Declarative Analytics

© 2014 IBM Corporation15 IBM Research – Almaden9/10/2014

Declarative Systems : The Relational WorldDeclarative Systems : The Relational WorldDeclarative Systems : The Relational WorldDeclarative Systems : The Relational World

Compute average salary

for each department

select D.did, avg(E.salary)

from Employee E, Department D

where E.did = D.did

group by D.did

SQL Query

Task

Declarative High-level Language

User specifies tasks in a high-level

language, w/o specifying algorithms for

data processing

…………

Tables, IndicesTables, IndicesTables, IndicesTables, Indices

Execution

Strategy

Query Optimization

System uses optimization strategies to

choose from alternate execution plans

Query OptimizerOptimization

Physical Data Independence

User does not have to worry about

physical data representation and

access aids while writing queries;

system manages the physical layer

Page 16: The Power of Declarative Analytics

© 2014 IBM Corporation16 IBM Research – Almaden9/10/2014

Pat Selinger

Why did Relational Systems succeed ?Why did Relational Systems succeed ?Why did Relational Systems succeed ?Why did Relational Systems succeed ?

Boeing said “We can ask questions we could

never find the answers to before. We’re now able

to do more than we could ever do before.”

Bruce Lindsay The invention of nonprocedural specification was

a tremendous simplification that made it much

easier to specify applications. No longer did you

have to say which index to use and which join

method to use to get the job done.

SIGMOD Record, June 2005

SIGMOD Record, December 2003

Michael Stonebraker

Query optimizers can beat all but the best

DBMS application programmers.

“What Goes Around Comes Around”,

Readings in Database Systems, 4th Edition, 2005

Page 17: The Power of Declarative Analytics

© 2014 IBM CorporationIBM Research – Almaden

What is common across these analytics tasks ?What is common across these analytics tasks ?What is common across these analytics tasks ?What is common across these analytics tasks ?

� Variety of problems and solutions–Every customer’s data & problems are unique in some way

–Need to quickly implement new business logic–Need to experiment with multiple algorithms for a particular analytic problem

� Quality of answers is very important !!

–High quality analytics � “complex” programs

–Skilled developers + domain experts

� Performance is critical

–Bigger data � faster execution

ComplexityComplexityComplexityComplexity

ScaleScaleScaleScale

BreadthBreadthBreadthBreadth

Page 18: The Power of Declarative Analytics

© 2014 IBM Corporation18 IBM Research – Almaden 9/10/2014

What am I going to talk about ?

SystemT

(Information Extraction)

SystemML

(Machine Learning)

Data Model

Operations

Language Syntax

Platform

Design

Choices

Analytics

Systems

� What makes analytics tasks difficult and what can be

learnt from the success of relational systems

� Brief description of declarative systems built at IBM and

the design choices made along the way

Page 19: The Power of Declarative Analytics

© 2014 IBM Corporation19 IBM Research – Almaden9/10/2014

Information Extraction Information Extraction Information Extraction Information Extraction ---- SystemTSystemTSystemTSystemT

Page 20: The Power of Declarative Analytics

© 2014 IBM Corporation20 IBM Research – Almaden9/10/2014

I went … to the OTIS concert last night

They played … “I Will Survive”…

a bunch of other bands also playing

The sax player in that band…

Concert Mention Pattern

Review within

200 tokens

Informal Music Band Reviews from Blogs

Consecutive Review Snippets are within 25 tokens

At least 3 occurrences of Music Review Snippet and Generic Review Snippet

Review ends with one of these.

Start with Concert Mention

Complete review is

within 200 tokens

Music

Review

Snippet

Music

Review

Snippet

Music

Review

Snippet

Page 21: The Power of Declarative Analytics

© 2014 IBM Corporation21 IBM Research – Almaden9/10/2014

StateStateStateState----ofofofof----thethethethe----art: Common Pattern Specification Language (CPSL)art: Common Pattern Specification Language (CPSL)art: Common Pattern Specification Language (CPSL)art: Common Pattern Specification Language (CPSL)

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur

risus in sagittis facilisis Jon Foreman their lead vocal/guitaristJon Foreman their lead vocal/guitaristJon Foreman their lead vocal/guitaristJon Foreman their lead vocal/guitarist hendrerit faucibus pede mi ipsum.

Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis,

Level 0Level 0Level 0Level 0

Lorem ipsum dolor sit amet, consectetuer adipiscing elit.

Proin Jon Foreman their lead vocal/ Jon Foreman their lead vocal/ Jon Foreman their lead vocal/ Jon Foreman their lead vocal/

<Instrument><Instrument><Instrument><Instrument> arcu tincidunt

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin in

sagittis , <BandMember> their lead <BandMember> their lead <BandMember> their lead <BandMember> their lead

vocal/guitaristvocal/guitaristvocal/guitaristvocal/guitarist rutrum velit sed amet lt arcu tincidunt

⟨Token⟩[~ “pipe | guitarist | …”] � ⟨Instrument⟩⟨ ⟩⟨Token⟩[~ “([A-Z]\w+)\s+[A-Z]\w+”] �⟨BandMember⟩

Level 1Level 1Level 1Level 1

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. luctus, risus in sagittis

facilisis <BandMember> their lead vocal/<Instrument><BandMember> their lead vocal/<Instrument><BandMember> their lead vocal/<Instrument><BandMember> their lead vocal/<Instrument> hendrerit faucibus pede mi ipsum.

Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis,

Example Rule: Band Member name followed within 5 tokens by Instrument clue is a Music Review Snippet

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. augue rutrum

lorem velit, sed <ReviewSnippet><ReviewSnippet><ReviewSnippet><ReviewSnippet>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat

⟨BandMember⟩ ⟨Token⟩{0,5} ⟨Instrument⟩� ⟨MusicReviewSnippet⟩

Level 2Level 2Level 2Level 2

A common language to specify and represent extraction rules as cascading grammars

Developed jointly between SRI and Department of Defense (1999)

Page 22: The Power of Declarative Analytics

© 2014 IBM Corporation22 IBM Research – Almaden9/10/2014

Why is this not sufficient ?Why is this not sufficient ?Why is this not sufficient ?Why is this not sufficient ?

� Counting and aggregations are not natural primitives in grammar and have to be handled in

custom code [Chiticariu, ACL 2010]

� Finely tuned grammar-based extraction system, with custom code for counting and

aggregation, took ~ 6 hours to extract reviews from a million web logs

Consecutive Review Snippets are within 25 tokens

At least 3 occurrences of Music Review Snippet or Generic Review Snippet

Review ends with one of these.

Start with Concert Mention

Complete review is

within 200 tokens

Page 23: The Power of Declarative Analytics

© 2014 IBM Corporation23 IBM Research – Almaden9/10/2014

SystemT SystemT SystemT SystemT –––– Declarative Approach to Information ExtractionDeclarative Approach to Information ExtractionDeclarative Approach to Information ExtractionDeclarative Approach to Information Extraction

Annotated

Document

Stream

AQL SystemT

Optimizer

SystemT

Runtime

Compiled

Graph

Compiled

Operator

Graph

Rule language with

familiar SQL-like syntax

Specify annotator

semantics declaratively

Choose an efficient

execution plan that

implements the

semantics

Highly scalable,

embeddable Java

runtime

Input

Document

Stream

See SIGMOD 2010 tutorial [Chiticariu et al., 2010]

for details on other recent declarative IE systems

Page 24: The Power of Declarative Analytics

© 2014 IBM Corporation24 IBM Research – Almaden9/10/2014

<BandMember> <Instrument>

0-5 tokens

create view MusicReviewSnippet as

select B.name as member, I.value as instrument,

CombineSpans(B.name,I.value) as review

from BandMember B, Instrument I

where FollowsTok(B.name, I.value, 0, 5);

Expressing Music Review Snippet Rule in AQL

create view BandMember as

extract regex /[A-Z]\w+\s+[A-Z]\w+] / on D.text

from Document D;

Choice of SQL-like syntax for AQL motivated by wider adoption of SQL

Page 25: The Power of Declarative Analytics

© 2014 IBM Corporation25 IBM Research – Almaden9/10/2014

What makes AQL expressive?What makes AQL expressive?What makes AQL expressive?What makes AQL expressive?

� Extraction primitives–Regular Expressions–Dictionary

� Text-specific primitives–Multi-lingual tokenization and parts-of-speech–Sentence and paragraph boundary detection–Span-based predicates

� Set-level primitives–Join–Block –Consolidation–Group By

Page 26: The Power of Declarative Analytics

© 2014 IBM Corporation26 IBM Research – Almaden9/10/2014

How will the Music Band Review extractor work in SystemT?How will the Music Band Review extractor work in SystemT?How will the Music Band Review extractor work in SystemT?How will the Music Band Review extractor work in SystemT?

Music

Review Snippet

Generic

Review Snippet

Concert Mention

Join

Union

Block

Join predicates

enforce additional

constraints

Find blocks of three or

more “Review Snippet”

patterns

…………

Review Snippet

Blocks of Review Snippet

Page 27: The Power of Declarative Analytics

© 2014 IBM Corporation27 IBM Research – Almaden9/10/2014

Class of Optimizations in SystemTClass of Optimizations in SystemTClass of Optimizations in SystemTClass of Optimizations in SystemT

� RewriteRewriteRewriteRewrite----basedbasedbasedbased: rewrite algebraic operator graph

–Shared Dictionary Matching–Shared Regular Expression Evaluation

–On-demand tokenization

� CostCostCostCost----basedbasedbasedbased: relies on novel selectivity estimation for text-specific operators

–Standard transformations • E.g., push down selections

–Restricted Span Evaluation• Evaluate expensive operators on restricted regions of the document

Tokenization overhead is paid only once

BandMember

(followed within 5 tokens)

Plan C

Plan A

Join

Instrument

Restricted Span Evaluation

Plan B

BandMemberIdentify Instrument starting

within 5 tokensExtract text to the right

InstrumentIdentify BandMember ending

within 5 tokensExtract text to the left

Page 28: The Power of Declarative Analytics

© 2014 IBM Corporation

Performance benefits using SystemT[Chiticariu et al. ACL’10]

� Music Band Review extraction task over a million web logs– SystemT vs. the grammar implementation

• 10 minutes vs. ~ 6 hours

� Named-entity extraction task over multiple document corpora– SystemT throughput ranges from 400 – 900 KB/sec/core (depending on the size of the document)

– SystemT vs. State-of-the-Art Learning-based System [Florian et al, CoNLL’03]~ 50 times higher throughput

– SystemT vs. State-of-the-Art Grammar-based System [ANNIE, Cunningham et al, ACL’02]~ 10 - 50 times higher throughput ~ 60 - 90% less memory consumption

� Revisiting the Twitter example, for keeping up with today’s tweets with 18 cores – SystemT takes 30 minutes per day as opposed to running 24/7 for the state-of-the-art system

Page 29: The Power of Declarative Analytics

© 2014 IBM Corporation29 IBM Research – Almaden9/10/2014

Runs fast ! But is SystemT expressive enough to compare on quality ? [Chiticariu et al. ACL’10, EMNLP’10]

� SystemT outperforms current best results on multiple benchmark datasets

– CoNLL 2003• F-measure between 89% and 92% for Person, Organization and Location

tasks

• Beats the state-of-the-art results consistently by up to 4%

– Enron Email• F-measure 85% for Person task

• Better than the state-of-the-art result by 7%

Page 30: The Power of Declarative Analytics

© 2014 IBM Corporation

What design choices did we make for SystemT ?

SystemT

(Information Extraction)

SystemML

(Machine Learning)

Data Model Document-at-a-time model

Data types: Span, Tuple, Relation

Operations Feature extraction primitives

Text-specific primitives

Set-level primitives

Language

Syntax

SQL-like syntax

Platform Embeddable runtime deployed in a wide

range of execution environments

Design

Choices

Analytics

Systems

Page 31: The Power of Declarative Analytics

© 2014 IBM Corporation31 IBM Research – Almaden IBM Confidential9/10/2014

SystemMLSystemMLSystemMLSystemML

Page 32: The Power of Declarative Analytics

© 2014 IBM Corporation32

Status Quo of Machine Learning AlgorithmsStatus Quo of Machine Learning AlgorithmsStatus Quo of Machine Learning AlgorithmsStatus Quo of Machine Learning Algorithms

� Machine Learning algorithm implementations today

– Specialized languages for Machine Learning

• R, Matlab

• Execution strategy for programs is determined by user

– Low-level implementations• Directly implement ML algorithms on specific platforms

–Hand-tuned implementations on specialized hardware GPU, BlueGene etc.

� But the programmer has to handle

– Performance optimizations due to data and compute platform characteristics – Parallelization for specific platforms

IBM Research – Almaden 9/10/2014

Page 33: The Power of Declarative Analytics

© 2014 IBM Corporation33

SystemML GoalsSystemML GoalsSystemML GoalsSystemML Goals

GNMF: V ≈ U = W H

MapReduce

Platform

Operator

Implementations

Higher level

MR1

MRn

MR2

'

Optimizations

V=readMM("in/V", rows=1e8, cols=1e5);

W=readMM("in/W", rows=1e8, cols=10);

H=readMM("in/H", rows=10, cols=1e5);

max_iteration=20;

i=0;

while(i<max_iteration){

H=H*(t(W)%*%V)/(t(W)%*%W%*%H);

W=W*(V%*%t(H))/(W%*%H%*%t(H));

i=i+1;}

� Provide language to implement ML algorithms

� Support specific ML constructs such as cross validation, bootstrapping, ensembles as first class citizens

�Optimizations based on data and system characteristics

� Scalable operator implementations

IBM Research – Almaden 9/10/2014

Page 34: The Power of Declarative Analytics

© 2014 IBM Corporation34

SystemML Architecture

� DML: Declarative Machine Learning Language

– Retain expressivity of current ML languages including

procedural constructs like while and for loops

� High-Level Operator (HOP) Component

– Represent dataflow in DAGs of matrices and scalar operations

– Choose from alternative execution plans using algebraic

rewrites and cost-based optimization

� Low-Level Operator (LOP) Component

– Low-level physical execution plan over key-value pairs

– “Piggyback” operations to reduce number of MapReduce jobs

� Runtime

– Efficient data representation and implementation of individual

operations in MapReduce framework

– Control module to orchestrate MR jobs

IBM Research – Almaden 9/10/2014

Page 35: The Power of Declarative Analytics

© 2014 IBM Corporation35

Simple Example of how SystemML works

Language HOP Component LOP Component Runtime

DC

B Binary hop

Divide

Binary hop

Multiply

C

A = B * (C / D)

Binary lop

Divide

Group lop

D

Binary lop

Multiply

Group lop

B

R1

M1

MR Job

LOP represents the physical plan for the program with a DAG for each statement block. LOP operates on key-value pairs and scalars

Multiple low-level operators combined in a MapReduce job

HOP represents the logical flow of the program as DAGs for each statement block.HOP operates on matrices and scalars

Input DML parsed into statement blocks with typed variables

IBM Research – Almaden 9/10/2014

Page 36: The Power of Declarative Analytics

© 2014 IBM Corporation

Declarative Machine Learning LanguageDeclarative Machine Learning LanguageDeclarative Machine Learning LanguageDeclarative Machine Learning Language

� Syntax borrowed from R

� What is supported

– Data Types: matrix, vector, scalar

– Statements

• Input/Output, Assignment, Control Structures (while, for), Rand

– Expressions

• Operators : Arithmetic, Comparative, Boolean, Matrix Multiplication

• Built-in Functions : Linear Algebra (transpose, …), Matrix aggregation (colSum, ...) ,

Mathematical (ln, sqrt, …)

– External Functions

– Machine Learning specific constructs : Cross validation, Ensemble learning

36 IBM Research – Almaden 9/10/2014

Page 37: The Power of Declarative Analytics

© 2014 IBM Corporation37

Categories of Optimization in SystemMLCategories of Optimization in SystemMLCategories of Optimization in SystemMLCategories of Optimization in SystemML

� HOP component

– Algebraic rewrites (e.g., matrix computation reordering)

– Cost-based optimization (e.g., choosing between different plans for matrix multiplication)

– Selection of physical representation of matrices (e.g., cell versus block representation)

� LOP component

– Piggybacking (packing lops that can be evaluated together in a single MapReduce job)

� Runtime

– Data representation (e.g., sparse versus dense)

– Sparsity-aware operator implementations

IBM Research – Almaden 9/10/2014

Page 38: The Power of Declarative Analytics

© 2014 IBM Corporation38

Performance Numbers

Gaussian NMF:V = readMM ("example.GNMF.V", rows= 1000, cols=100, nnzs= 2000, format="text");

W = readMM ("example.GNMF.W", rows= 1000, cols=20, nnzs= 20000, format="text");

H = readMM ("example.GNMF.H", rows= 20, cols=100, nnzs= 2000, format="text");

max_iteration = 10

i = 0

while (i < max_iteration) {

H = H * ((t(W) %*% V) / ( (t(W) %*% W) %*% H))

W = W * ((V %*% t(H)) / ( W %*% (H %*% t(H))))

i = i + 1

}

writeMM (W,"example.GNMF.W.result", format="text");

writeMM (H,"example.GNMF.H.result", format="text");

Data SizeTime per

iteration

Lines of

CodeRuntime Platform

In SystemML5 billion non zeros

(50m X 100k, sparsity 1x10-3)

1.2 hours11 lines of DML code

40 cores, 4 GB RAM per core

WWW 20104.38 billion non zeros

(43.9 m X 768m, sparsity 1.3x10-7)

7 hours1500 lines of Java code

SCOPE cluster

IBM Research – Almaden 9/10/2014

Page 39: The Power of Declarative Analytics

© 2014 IBM Corporation39

Additional AlgorithmsAdditional AlgorithmsAdditional AlgorithmsAdditional Algorithms

Exe

cutio

n T

ime

(sec

)

0

50

100

150

200

250

300

350

400

#rows and #columns in G (thousand)0 400 800 1200 1600

DML PageRank

Exe

cutio

n T

ime

(sec

)

0

200

400

600

800

#rows in V (million)0 2 4 6 8 10 12 14 16 18 20

DML Linear Regression

G=readMM("in/G", rows=1e6, cols=1e6);

p=readMM("in/p", rows=1e6, cols=1);

e=readMM("in/e", rows=1e6, cols=1);

ut=readMM("in/ut", rows=1, cols=1e6);

alpha=0.85;

max_iteration=20;

i=0;

while(i<max_iteration){

p=alpha*(G%*%p)+(1-alpha)*(e%*%u%*%p);

i=i+1}

writeMM(p, "out/p");

V=readMM("in/V", rows=1e8, cols=1e5);

b=readMM("in/b", rows=1e8, cols=1);

lambda = 1e-6;

r=-b ;

p=-r ;

norm_r2=sum(r*r);

max_iteration=20;

i=0;

while(i<max_iteration){

q=((t(V) %*% (V %*% p)) + lambda*p)

alpha= norm_r2/(t(p)%*%q);

w=w+alpha*p;

old_norm_r2=norm_r2;

r=r+alpha*q;

beta=norm_r2/old_norm_r2;

p=-r+beta*p;

i=i+1;}

writeMM(w, "out/w");

PageRank

Sparse Linear Regression

V: d x 100000, sparsity=0.001G: n x n, sparsity=0.001

IBM Research – Almaden 9/10/2014

Page 40: The Power of Declarative Analytics

© 2014 IBM Corporation

What design choices did we make for SystemML ?What design choices did we make for SystemML ?What design choices did we make for SystemML ?What design choices did we make for SystemML ?

SystemT(Information Extraction)

SystemML(Machine Learning)

Data Model Document-at-a-time modelData types: Span, Tuple, Relation

Data types: Matrix, Vector, Scalar

Operations Feature extraction primitivesText-specific primitivesSet-level primitives

Procedural constructs e.g., while, for

Linear Algebra operationsExternal Functions Machine Learning specific constructs

e.g., Cross Validation, Ensemble Learning

Language Syntax

SQL-like syntax R-like syntax

Platform Embeddable runtime deployed in a wide range of execution environments

MapReduce Runtime

Design Choices

AnalyticsSystems

Page 41: The Power of Declarative Analytics

© 2014 IBM Corporation41

SummarySummarySummarySummary

Page 42: The Power of Declarative Analytics

© 2014 IBM Corporation42

Lessons LearnedLessons LearnedLessons LearnedLessons Learned

– SystemT• Ships with eight IBM products• To date have not encountered a request that is not expressible in AQL

– SystemML• Ships with IBM BigInsights August beta this year• Declarative is the goal; but to express Machine Learning algorithms procedural constructs are needed• Users naturally gravitate to procedural constructs. Limiting usage of such constructs to only when required to specify “what needs to be done” may need lot of training

– SystemT• Choice of SQL-like syntax and Eclipse-based tooling quickly enabled hundreds of users with varied background

• But traditional NLP-trainees prompted us to provide a layer on top of AQL with grammar-like syntax• Business users demand even simpler and more usable tooling

– SystemML• Early days but multiple users inside IBM and almost all are previous R / Matlab users.• Familiar R syntax helps ML users up and running al most immediately

– SystemT• Document at a time model and all in-memory optimizations• Demonstrates that an order-of-magnitude throughput improvement can be obtained • Hardware acceleration further speed up the execution

– SystemML• Computation on a large-scale distributed platform• Initial experiences reinforce the argument “Query optimizers can beat all but the best programmers”IBM Research – Almaden 9/10/2014

Page 43: The Power of Declarative Analytics

© 2014 IBM Corporation

Tooling Research for the Development Life-Cycle

Develop

TestAnalyze

Development

DeployRefine

Test

Maintenance

Task Analysis

[ACL’11,12,13,CHI’13]

• Concordance Viewer

• Active labeling

• Labeling tool

• Extraction plan

• Track provenance [VLDB’10]

• Contextual clue discovery[CIKM’11]

• Regex learning [EMNLP’08]

• Suggest rule changes [VLDB’10]

• Rule induction [EMNLP’12]

• Dictionary refinement [SIGMOD’13]

• Rule learning

• NE Interface [EMNLP’10]

• Tagger UI [SIGMOD’07]

Page 44: The Power of Declarative Analytics

© 2014 IBM Corporation

Eclipse Tools OverviewEclipse Tools OverviewEclipse Tools OverviewEclipse Tools Overview

Ease ofEase ofEase ofEase of

ProgrammingProgrammingProgrammingProgramming

PerformancePerformancePerformancePerformance

TuningTuningTuningTuning

AutomaticAutomaticAutomaticAutomatic

DiscoveryDiscoveryDiscoveryDiscovery

AQL Editor

Explain

Pattern Discovery

Result Viewer

Regex Learner

AQL Editor:AQL Editor:AQL Editor:AQL Editor: syntax highlighting, auto-complete,

hyperlink navigation

Result Viewer:Result Viewer:Result Viewer:Result Viewer: visualize/compare/evaluate

Explain:Explain:Explain:Explain: show how each result was generated

Workflow UIWorkflow UIWorkflow UIWorkflow UI: end-to-end development wizard

Regex Generator:Regex Generator:Regex Generator:Regex Generator: generate regular expressions

from examples

Pattern DiscoveryPattern DiscoveryPattern DiscoveryPattern Discovery: identify patterns in the data

ProfilerProfilerProfilerProfiler: identify performance bottlenecks to be

hand tuned

Page 45: The Power of Declarative Analytics

© 2014 IBM Corporation

Web Tools OverviewWeb Tools OverviewWeb Tools OverviewWeb Tools Overview

Ease ofEase ofEase ofEase of

ProgrammingProgrammingProgrammingProgramming

Ease ofEase ofEase ofEase of

SharingSharingSharingSharing

Canvas:Canvas:Canvas:Canvas:

• Visual construction of extractors

• Customization of existing extractors

Result Viewer:Result Viewer:Result Viewer:Result Viewer: visualize/compare/evaluate

Concept catalog:Concept catalog:Concept catalog:Concept catalog: share concepts

Project: Project: Project: Project: share extractor development

Even for non-programmers

Page 46: The Power of Declarative Analytics

© 2014 IBM Corporation46

Don Chamberlin

We set out to help non-programmers interact with databases to open up access to data to a whole new class of people who could do things that were never possible before. The problem that we didn't think we were working on at all was how to embed query languages into host languages, or how to make a language that would serve as an interchange medium between different systems -those are the ways in which SQL ultimately turned out to be very successful,

SQL Reunion, 1995

Don observed that success of SQL was due to the language serving as an

interchange medium between systems. In contrast declarative systems

for analytics may indeed be successful for the original purpose that SQL

was intended – open up access to analytics to a whole new class of people

Don has the last word … Don has the last word … Don has the last word … Don has the last word …

Maybe ..

Page 47: The Power of Declarative Analytics

© 2014 IBM Corporation

Thank You!

47