Top Banner
William M. Pottenger, Ph.D. Higher Order Learning Higher Order Learning William M. Pottenger, Ph.D. Rutgers University ARO Workshop
32
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Higher Order Learning

William M. Pottenger, Ph.D.

Higher Order LearningHigher Order Learning

William M. Pottenger, Ph.D.Rutgers University

ARO Workshop

Page 2: Higher Order Learning

William M. Pottenger, Ph.D.

• Introduction• Overview

– IID Assumption in Machine Learning– Statistical Relational Learning (SRL)– Higher-order Co-occurrence Relations

• Approach– Supervised Higher Order Learning– Unsupervised Higher Order Learning

• Conclusion

OutlineOutline

Page 3: Higher Order Learning

William M. Pottenger, Ph.D.

IID Assumption in Machine LearningIID Assumption in Machine Learning

• Data mining tasks such as association rule mining, cluster analysis, classification aim to find patterns/form a model from a collection of instances.

• Traditionally instances are assumed to be independent and identically distributed (IID).– In classification, a model is applied to a single instance

and the decision is based on the feature vector of this instance in a “context-free” manner, independent of the other instances in the test set.

• This context-free approach does not exploit the available information about relationships between instances in the dataset (Angelova & Weikum, 2006)

Page 4: Higher Order Learning

William M. Pottenger, Ph.D.

Statistical Relational Learning (SRL)Statistical Relational Learning (SRL)

• Underlying assumption – Linked instances are often correlated

• SRL operates on relational data with explicit links between instances– Explicitly leverages correlations between related instances

• Collective Inference / Classification– Simultaneously label all test instances together

– Exploit the correlations between class labels of related instances

– Learn from one network (a set of labeled training instances with links )

– Apply the model to a separate network (a set of unlabeled test instances with links)

• Iterative algorithms– First assign initial class labels (content-only traditional

classifier)– Adjust class label using the class labels of linked instances

Page 5: Higher Order Learning

William M. Pottenger, Ph.D.

Statistical Relational Learning (SRL)Statistical Relational Learning (SRL)

• Several tasks– Collective classification / Link based

classification– Link prediction– Link based clustering– Social network modeling– Object identification– Bibliometrics– …– Ref: P. Domingos and M. Richardson,

Markov Logic: A Unifying Framework for Statistical Relational Learning. Proceedings of the ICML-2004 Workshop on Statistical Relational Learning and its Connections to Other Fields (pp. 49-54), 2004. Banff, Canada: IMLS.

Page 6: Higher Order Learning

William M. Pottenger, Ph.D.

Some Related Work in Some Related Work in SRLSRL

• Relational Markov Networks (Taskar et al., 2002)– Extend Markov networks for relational data– Discriminatively train undirected graphical model - for every link

between two pages, there is an edge between labels of these pages– Significant improvement over the flat model (logistic regression)

• Link-based Classification (Lu & Getoor, 2003)– Structured logistic regression– Iterative classification algorithm– Outperforms content-only classifier on WebKB, Cora, CiteSeer

datasets• Relational Dependency Networks (Neville & Jensen, 2004)

– Extend Dependency Networks (DNs) for relational data– Experiments on IMDb, Cora, WebKB, Gene datasets– Results: RDN model is superior to IID classifier

• Graph-based Text Classification (Angelova & Weikum, 2006)– Has graph in which nodes are instances and edges are the

relationships between instances in the dataset– Increase in performance on DBLP, IMDb, Wikipedia datasets– Interesting observation: gains are most prominent for small training

sets

Page 7: Higher Order Learning

William M. Pottenger, Ph.D.

Reasoning by Abductive Inference Reasoning by Abductive Inference

• Need for reasoning from evidence, even in the face of information that may be incomplete, inexact, inaccurate, or from diverse sources

• Evidence is provided by sets of diverse, distributed, and noisy sensors and information.

• Build a quantitative theoretical framework for reasoning by abduction in the face of real-world uncertainties.

• Reasoning by leveraging higher order relations…

Page 8: Higher Order Learning

William M. Pottenger, Ph.D.

Gathering EvidenceGathering Evidence

stress

migraine

CCB

magnesium

PA

magnesium

SCD

magnesiummagnesium

Slide reused with permission of Marti Hearst @ UCB

Page 9: Higher Order Learning

William M. Pottenger, Ph.D.

A Higher Order A Higher Order Co-OccurrenceCo-Occurrence Relation!Relation!

migraine magnesium

stress

CCB

PA

SCD

Slide reused with permission of Marti Hearst @ UCB

No single author knew/wrote about this connection… this distinguishes Text Mining from Information Retrieval.

Page 10: Higher Order Learning

William M. Pottenger, Ph.D.

Uses of Higher-order Co-occurrence Uses of Higher-order Co-occurrence RelationsRelations

Higher-order co-occurrences play a key role in the effectiveness of systems used for information retrieval and text mining

• Literature Based Discovery (LBD) (Swanson, 1988)– Migraine↔(stress, calcium channel blockers)↔ Magnesium

• Improve the runtime performance of LSI (Zhang et al., 2000)– Explicitly use 2nd order co-occurrence to reduce MT×D

• Word sense disambiguation (Schütze,1998)– Similarity in word space is based on 2nd order co-

occurrence • Identifying synonyms in a given context (Edmonds,1997)

– Precision of system using 3rd order > 2nd order > 1st order • Stemming algorithm (Xu & Croft, 1998)

– Implicitly uses higher orders of co-occurrence

Page 11: Higher Order Learning

William M. Pottenger, Ph.D.

Is there a theoretical basis for the Is there a theoretical basis for the use of higher order co-occurrence use of higher order co-occurrence

relations?relations?• Research agenda: study machine learning

algorithms in search of a theoretical foundation for the use of higher order relations

• First algorithm: Latent Semantic Indexing (LSI)– Widely used technique in text mining and IR based

on the Singular Value Decomposition (SVD) matrix factoring algorithm

– Terms semantically similar lie closer in LSI vector space even though they don’t co-occur LSI reveals hidden or latent relationships

– Research question: Does LSI leverage higher order term co-occurrence?

Page 12: Higher Order Learning

William M. Pottenger, Ph.D.

• Yes! Answer is in the following theorem we proved: If the ijth element of the truncated term by term matrix, Y, is non-zero, then there exists a co-occurrence path of order 1 between terms i and j.– Kontostathis, A. and Pottenger, W. M. (2006) A

Framework for Understanding LSI Performance. Information Processing & Management.

• We have both proven mathematically and demonstrated empirically that LSI is based on the use of higher order co-occurrence relations.

• Next step? Extend the theoretical foundation by studying characteristics of higher-order relations in other machine learning datasets/algorithms such as association rule mining, supervised learning, etc.– Start by analyzing higher-order relations in labeled

training data used in supervised machine learning

Is there a theoretical basis for the use Is there a theoretical basis for the use of higher order co-occurrence relations of higher order co-occurrence relations

in LSI?in LSI?

Page 13: Higher Order Learning

William M. Pottenger, Ph.D. 13

What role do higher-order relations What role do higher-order relations play in supervised machine learning?play in supervised machine learning?• Goal: discover patterns in higher-order paths

useful in separating the classes• Co-occurrence relations in a record or instance set

can be represented as an undirected graph G = (V, E) – V : a finite set of vertices (e.g., entities in a record) – E is the set of edges representing co-occurrence relations

(edges are labeled with the record(s) in which entities co-occur)

• Path definition from graph theory: Two vertices xi and xk are linked by a path P (nodes xi distinct) where the number of edges in P is its length.

• Higher-order path: Not only vertices (entities) must be distinct but also edges (records) must be distinct.

e1 e5e4e3e2r1 r2 r3 r4

r5

r6

An example of a fourth-order path between e1 and e5, as well as several shorter paths

Page 14: Higher Order Learning

William M. Pottenger, Ph.D. 14

What role do higher-order relations What role do higher-order relations play in supervised machine play in supervised machine

learning?learning?• Path Group: A path (length≥2) is

extracted per the definition of a path from graph theory. In the example, a 2nd order path group comprises two sets of records: S1={1,2,5} and S2={1,2,3,4}. A path group may be composed of several higher-order paths.

• A bipartite graph G = (V1 U V2, E) is formed where V1 is the sets of records and V2 is the records. Enumerating all maximum matchings in this graph yields all higher-order paths in the path group. Another approach is to discover the system of distinct representatives (SDR) of these sets.

S1

R1

S2

R2

R3

R4

R5

An example co-occurrence graph

e1e2

e3

e4e5

e1 e2 e3R1R2R5

R1R2R3R4

An example 2nd order path group

e1 e2 e3R1 R3

A valid 2nd order path

Page 15: Higher Order Learning

William M. Pottenger, Ph.D.

• Approach: Discover frequent itemsets in higher-order paths – For labeled datasets, divide instances by class and

enumerate k-itemsets (initially for k in {3,4})– Results in a distribution of k-itemset frequencies for

a given class – Compare distributions using simple statistical

measure such as t-test to determine independence – If two distributions are statistically significantly

different, we conclude that the higher-order path patterns (i.e., itemset frequencies) distinguish the classes

• Labeled training data analyzed– Mushroom dataset: performs well on decision tree– Border gateway protocol updates: relevant to

cybersecurity

What role do higher-order relations What role do higher-order relations play in supervised machine learning?play in supervised machine learning?

Page 16: Higher Order Learning

William M. Pottenger, Ph.D.

• For each fold; compared 3-itemset frequencies of E set vs. P set

• Interesting result: six of the 10 folds had a confidence of 95% or greater that the E and P instances are statistically significantly different– Other folds between 80-95% (see below)

Fold t Stat P(T<=t) one-tail

t_Critical one-tail

P(T<=t) two-tail

t_Critical two-tail

0 -2.684 0.0037 1.6471 0.0074 1.9634

1 -1.357 0.0875 1.6467 0.1751 1.9629

2 -1.554 0.0603 1.6468 0.1205 1.9629

3 -2.924 0.0018 1.6472 0.0036 1.9636

4 -1.908 0.0284 1.6469 0.0568 1.9631

5 -2.047 0.0205 1.6469 0.041 1.9631

6 -1.455 0.073 1.6467 0.146 1.9629

7 -2.023 0.0217 1.6469 0.0434 1.9631

8 -2.795 0.0027 1.6471 0.0053 1.9635

9 -2.71 0.0034 1.647 0.0069 1.9633

Preliminary Results – Supervised ML Preliminary Results – Supervised ML datasetdataset

Ganiz, M., Pottenger, W.M. and Yang, X. (2006). Link Analysis of Higher-Order Paths in Supervised Learning Datasets, In the Proceedings of the Workshop on Link Analysis, Counterterrorism and Security, 2006 SIAM Conference on Data Mining, Bethesda, MD, April

Page 17: Higher Order Learning

William M. Pottenger, Ph.D.

What role do higher-order relations What role do higher-order relations play in supervised machine learning?play in supervised machine learning?

• Detection of Interdomain Routing Anomalies Based on Higher-Order Path Analysis– Border Gateway Protocol (BGP) is de facto

interdomain routing protocol for Internet.– Anomalous BGP events: misconfigurations,

attacks and large-scale power failures often affect the global routing infrastructure.– Slammer worm attack (January 25, 2003 )– Witty worm attack (March 19, 2004)– 2003 East Coast Blackout (i.e., power failure)

– Goal: detect and categorize such events

Page 18: Higher Order Learning

William M. Pottenger, Ph.D.

What role do higher-order relations What role do higher-order relations play in supervised machine learning?play in supervised machine learning?

• Detection of Interdomain Routing Anomalies Based on Higher-Order Path Analysis– The data divided into three-second bins– Each bin is a single instance in our

training data ID Attribute Definition

1 Announce # of BGP announcements

2 Withdrawal # of BGP withdrawals

3 Update # of BGP updates(=Announce + Withdrawal )

4 Announce Prefix # of announced prefixes

5 Withdraw Prefix # of withdrawn prefixes

6 Updated Prefix # of updated prefixes(=Announce Prefix + Withdraw Prefix)

Page 19: Higher Order Learning

William M. Pottenger, Ph.D.

• Border Gateway Protocol (BGP) routing data– BGP messages generated during interdomain

routing – Relevant to cybersecurity– Detect abnormal BGP events

– Internet worm attacks (slammer, witty,…), power failures, etc.

– Data from a period of time surrounding/including worm propagation

– Instance ->three second sample of BGP traffic– Six numeric attributes (Li et al., 2005)

– Previously decision tree applied successfully for two classes: worm vs. normal (Li et al., 2005)– Cannot distinguish different worms!

Preliminary Results – BGP datasetPreliminary Results – BGP dataset

Page 20: Higher Order Learning

William M. Pottenger, Ph.D.

Preliminary Results – BGP datasetPreliminary Results – BGP dataset

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37

sliding window (Slammer)

P two-tail %5 level

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37

sliding window (Witty)

P two-tail %5 level

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37

sliding window (Blackout)

P two-tail %5 level

Event 1 Event 2 t-test results

Slammer Witty 0.00023

Blackout Witty 0.00016

Slammer Blackout 0.018

• 240 instances to characterize a particular abnormal event• Sliding window approach for detection

– Window size: 120 instances (360 seconds)– Sliding 10 instances (sampling every 30 seconds)

Ganiz, M., Pottenger, W.M., Kanitkar, S., Chuah, M.C. (2006b). Detection of Interdomain Routing Anomalies Based on Higher-Order Path Analysis. Proceedings of the Sixth IEEE International Conference on Data Mining (ICDM’06), December 2006, Hong Kong, China

Page 21: Higher Order Learning

William M. Pottenger, Ph.D.

Preliminary Results – Preliminary Results – Naïve Bayes on Higher-order PathsNaïve Bayes on Higher-order Paths

• Cora (McCallum et al., 2000) – Scientific paper dataset – Several classes: case based, neural networks, etc.– 2708 documents, 1433 terms, 5429 links– Terms are ordered most sparse first

• Instead of links, we used higher order paths in a Naïve Bayes framework

• E.g., when 2nd order paths are used, F-beta (beta=1) is higher starting from dictionary size 400Cora Dataset - macro averaged F1 by increasing dictionary size

0

10

20

30

40

50

60

70

80

200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1433

dictionary size

Ac

cu

rac

y

Fb1

Fb2

Page 22: Higher Order Learning

William M. Pottenger, Ph.D.

What role do higher-order relations play in What role do higher-order relations play in unsupervised machine learning?unsupervised machine learning?

• Next step? Consider unsupervised learning…– Association Rule Mining (ARM)

• ARM is one of the most widely used algorithms in data mining– Extend ARM to higher order… Higher Order

Apriori

• Experiments confirm the value of Higher Order Apriori on real world e-marketplace data

Page 23: Higher Order Learning

William M. Pottenger, Ph.D.

Higher Order Apriori: Approach Higher Order Apriori: Approach

• First we extend the itemset definition to incorporate k-itemsets up to nth-order

– Definition 1: item a and b are nth-order associated, If a and b can be associated across n distinct records.

– Definition 2: An nth-order k-itemset is a k-itemset for which each pair of its items is nth-order associated.

– Definition 3: 2 records are nth-order linked if they can be linked through n-2 distinct records.

– Definition 4: A nth-order itemset i1i2…in is supported by a nth-order recordset r1r2...rn if no two items come from the same record.

niii rrr n 121 ~~~ 21

biiia nn rn

rrr ~~~~~ 121121

Page 24: Higher Order Learning

William M. Pottenger, Ph.D.

Higher Order Apriori: ApproachHigher Order Apriori: Approach

• Given j instances of nth-order k-recordset rs, its

size is defined as:

• Since the same k-itemset can be generated at different orders, the global support for a given k-itemset must include the local support at each

order u. So we have the this formula:

j

t

kk

u

n

vvkn Irssize

1

2/)1(

1

1

1_ )()(

order

u

ku

k u

rssizeis

max_

1

_10 1)(log)(sup

Page 25: Higher Order Learning

William M. Pottenger, Ph.D.

Higher Order Apriori: ApproachHigher Order Apriori: Approach

• Higher Order Apriori is structured in a level-wise order-first manner. – Level-wise: the size of k-

itemsets increases in each iteration (as is the case for Apriori),

– Order-first: at each level,

itemsets are generated across all orders.

Page 26: Higher Order Learning

William M. Pottenger, Ph.D.

Higher Order Apriori: Results Higher Order Apriori: Results

• Our algorithm was tested on real-world e-commerce data, from the KDD Cup 2000 competition. there are 530 transactions involving 244 products in the dataset.

• We compared the itemsets generated by Higher Order Apriori with two other algorithms:

– Apriori (1st order) – Indirect (Using our algorithm limited to 2nd

order). – Higher Order Apriori limited to 6th order.

• We conducted experiments on multiple systems, including at the National Center for Supercomputing Applications (NCSA)

Page 27: Higher Order Learning

William M. Pottenger, Ph.D.

Higher Order Apriori: ResultsHigher Order Apriori: Results

• Higher Order Apriori mines significantly more final itemsets than Apriori and Indirect

33.23.43.63.8

44.24.44.6

50 75 100 200 530

Number of Records (k=2 for HOApriori and Indirect)

Lo

g (

Nu

mb

er o

f It

em s

ets)

Apriori Indirect HOApriori

• Next we will show high support itemsets are discovered using smaller datasets than required by Apriori or Indirect

Page 28: Higher Order Learning

William M. Pottenger, Ph.D.

Higher Order Apriori: ResultsHigher Order Apriori: Results• {CU, DQ} is the top ranked 2-itemset using Apriori

on all 530 transactions.– Neither Apriori nor Indirect, leverage the latent higher

order information in runs of 75, 100 and 200 random transactions

– While Higher Order Apriori discovered this itemset as top ranked using only 75 transactions,

– In addition, the gap between the supports increases as the transaction sets get larger

0123456

0 100 200 300 400 500 600

Number of Records

Rank

ing

Indirect HOApriori

00.5

11.5

22.5

33.5

44.5

5

0 50 100 150 200 250

Number of Records

Sup

port

Apriori Indirect HOApriori

Page 29: Higher Order Learning

William M. Pottenger, Ph.D.

Higher Order Apriori: ResultsHigher Order Apriori: Results

• Discovering Novel Itemsets Apriori Indirect Higher Order

Apriori

Itemsets Discovered {AY, X }{X, K}{K, Q)

{AY, K}{X, Q}

+ Apriori Itemsets

{AY, Q} + Indirect Itemsets+ Apriori Itemsets

Itemsets Undiscovered

{AY, K} {X, Q} {AY, Q}

• AY- Girdle-at-the-top Classic Sheer Pantyhose

• Q- Men’s City Rib Socks - 3 Pack

Shaver : Women’s Pantyhose relationship

Apriori (Donna Karan’s Extra Thin Pantyhose, Wet/Dry Shaver)

Indirect (Berkshire’s Ultra Nudes Pantyhose, Epilady Wet/Dry Shaver)

Higher-order Apriori

(Donna Karan’s Pantyhose, Epilady Wet/Dry Shaver)

• This relationship is also discovered by Apriori and Indirect, but Higher order Apriori discovered a new nugget, which provides extra evidence for this relationship

Page 30: Higher Order Learning

William M. Pottenger, Ph.D.

Higher Order Apriori: ResultsHigher Order Apriori: Results

• Discovering Novel Relationships – Higher Order Apriori discovers itemsets that demonstrate

novel relationships not discovered by lower order methods.

– For example, the following are reasonable relationships. While Apriori and Indirect failed to discover itemsets representing such relationships in the SIGKDD dataset, they might discover such links given a larger data set.

Shaver : Lotion/Cream

Apriori No

Indirect No

HigherOrderApriori

(Pedicure Care Kit, Toning Lotion)(wet/dry shaver, Herb Lotion)

(Pedicure Care Kit, Leg Cream)

foot cream – women’s socks

Apriori No

Indirect No

Higher-order Apriori

(foot cream, Women's Ultra Sheer Knee High)(foot cream, women’s cotton dog sock)

Page 31: Higher Order Learning

William M. Pottenger, Ph.D.

ConclusionsConclusions

• Many traditional machine learning algorithms assume instances are independent and identically distributed (I.I.D.) – Apply model to a single instance (decision is based on the

feature vector) in a “context-free” manner– Independent of the other instances in the test set

• Statistical Relational Learning (SRL)– Classifies a set of instances simultaneously (collective

classification)– Utilizes relations (links) between instances in dataset– Usually considers immediate neighbors– Violates “independence” assumption

• Our approach utilizes the latent information based on higher order paths – Utilizes higher order paths of order greater than or equal to

two – Higher-order paths are implicit; based on co-occurrences of

entities– We do not use the explicit links in the dataset!

– Captures “latent semantics” (aka Latent Semantic Indexing)

Page 32: Higher Order Learning

William M. Pottenger, Ph.D.

ThanksThanks

Q&A