Higher Order Learning

William M. Pottenger, Ph.D.

Higher Order LearningHigher Order Learning

William M. Pottenger, Ph.D.Rutgers University

ARO Workshop


• Introduction• Overview

– IID Assumption in Machine Learning– Statistical Relational Learning (SRL)– Higher-order Co-occurrence Relations

• Approach– Supervised Higher Order Learning– Unsupervised Higher Order Learning

• Conclusion

OutlineOutline


IID Assumption in Machine LearningIID Assumption in Machine Learning

• Data mining tasks such as association rule mining, cluster analysis, classification aim to find patterns/form a model from a collection of instances.

• Traditionally instances are assumed to be independent and identically distributed (IID).– In classification, a model is applied to a single instance

and the decision is based on the feature vector of this instance in a “context-free” manner, independent of the other instances in the test set.

• This context-free approach does not exploit the available information about relationships between instances in the dataset (Angelova & Weikum, 2006)


Statistical Relational Learning (SRL)Statistical Relational Learning (SRL)

• Underlying assumption – Linked instances are often correlated

• SRL operates on relational data with explicit links between instances– Explicitly leverages correlations between related instances

• Collective Inference / Classification– Simultaneously label all test instances together

– Exploit the correlations between class labels of related instances

– Learn from one network (a set of labeled training instances with links )

– Apply the model to a separate network (a set of unlabeled test instances with links)

• Iterative algorithms– First assign initial class labels (content-only traditional

classifier)– Adjust class label using the class labels of linked instances


Statistical Relational Learning (SRL)Statistical Relational Learning (SRL)

• Several tasks– Collective classification / Link based

classification– Link prediction– Link based clustering– Social network modeling– Object identification– Bibliometrics– …– Ref: P. Domingos and M. Richardson,

Markov Logic: A Unifying Framework for Statistical Relational Learning. Proceedings of the ICML-2004 Workshop on Statistical Relational Learning and its Connections to Other Fields (pp. 49-54), 2004. Banff, Canada: IMLS.


Some Related Work in Some Related Work in SRLSRL

• Relational Markov Networks (Taskar et al., 2002)– Extend Markov networks for relational data– Discriminatively train undirected graphical model - for every link

between two pages, there is an edge between labels of these pages– Significant improvement over the flat model (logistic regression)

• Link-based Classification (Lu & Getoor, 2003)– Structured logistic regression– Iterative classification algorithm– Outperforms content-only classifier on WebKB, Cora, CiteSeer

datasets• Relational Dependency Networks (Neville & Jensen, 2004)

– Extend Dependency Networks (DNs) for relational data– Experiments on IMDb, Cora, WebKB, Gene datasets– Results: RDN model is superior to IID classifier

• Graph-based Text Classification (Angelova & Weikum, 2006)– Has graph in which nodes are instances and edges are the

relationships between instances in the dataset– Increase in performance on DBLP, IMDb, Wikipedia datasets– Interesting observation: gains are most prominent for small training

sets


Reasoning by Abductive Inference Reasoning by Abductive Inference

• Need for reasoning from evidence, even in the face of information that may be incomplete, inexact, inaccurate, or from diverse sources

• Evidence is provided by sets of diverse, distributed, and noisy sensors and information.

• Build a quantitative theoretical framework for reasoning by abduction in the face of real-world uncertainties.

• Reasoning by leveraging higher order relations…


Gathering EvidenceGathering Evidence

stress

migraine

CCB

magnesium

PA

magnesium

SCD

magnesiummagnesium

Slide reused with permission of Marti Hearst @ UCB


A Higher Order A Higher Order Co-OccurrenceCo-Occurrence Relation!Relation!

migraine magnesium

stress

CCB

PA

SCD

Slide reused with permission of Marti Hearst @ UCB

No single author knew/wrote about this connection… this distinguishes Text Mining from Information Retrieval.


Uses of Higher-order Co-occurrence Uses of Higher-order Co-occurrence RelationsRelations

Higher-order co-occurrences play a key role in the effectiveness of systems used for information retrieval and text mining

• Literature Based Discovery (LBD) (Swanson, 1988)– Migraine↔(stress, calcium channel blockers)↔ Magnesium

• Improve the runtime performance of LSI (Zhang et al., 2000)– Explicitly use 2nd order co-occurrence to reduce MT×D

• Word sense disambiguation (Schütze,1998)– Similarity in word space is based on 2nd order co-

occurrence • Identifying synonyms in a given context (Edmonds,1997)

– Precision of system using 3rd order > 2nd order > 1st order • Stemming algorithm (Xu & Croft, 1998)

– Implicitly uses higher orders of co-occurrence


Is there a theoretical basis for the Is there a theoretical basis for the use of higher order co-occurrence use of higher order co-occurrence

relations?relations?• Research agenda: study machine learning

algorithms in search of a theoretical foundation for the use of higher order relations

• First algorithm: Latent Semantic Indexing (LSI)– Widely used technique in text mining and IR based

on the Singular Value Decomposition (SVD) matrix factoring algorithm

– Terms semantically similar lie closer in LSI vector space even though they don’t co-occur LSI reveals hidden or latent relationships

– Research question: Does LSI leverage higher order term co-occurrence?


• Yes! Answer is in the following theorem we proved: If the ijth element of the truncated term by term matrix, Y, is non-zero, then there exists a co-occurrence path of order 1 between terms i and j.– Kontostathis, A. and Pottenger, W. M. (2006) A

Framework for Understanding LSI Performance. Information Processing & Management.

• We have both proven mathematically and demonstrated empirically that LSI is based on the use of higher order co-occurrence relations.

• Next step? Extend the theoretical foundation by studying characteristics of higher-order relations in other machine learning datasets/algorithms such as association rule mining, supervised learning, etc.– Start by analyzing higher-order relations in labeled

training data used in supervised machine learning

Is there a theoretical basis for the use Is there a theoretical basis for the use of higher order co-occurrence relations of higher order co-occurrence relations

in LSI?in LSI?

William M. Pottenger, Ph.D. 13

What role do higher-order relations What role do higher-order relations play in supervised machine learning?play in supervised machine learning?• Goal: discover patterns in higher-order paths

useful in separating the classes• Co-occurrence relations in a record or instance set

can be represented as an undirected graph G = (V, E) – V : a finite set of vertices (e.g., entities in a record) – E is the set of edges representing co-occurrence relations

(edges are labeled with the record(s) in which entities co-occur)

• Path definition from graph theory: Two vertices xi and xk are linked by a path P (nodes xi distinct) where the number of edges in P is its length.

• Higher-order path: Not only vertices (entities) must be distinct but also edges (records) must be distinct.

e1 e5e4e3e2r1 r2 r3 r4

r5

r6

An example of a fourth-order path between e1 and e5, as well as several shorter paths

William M. Pottenger, Ph.D. 14

What role do higher-order relations What role do higher-order relations play in supervised machine play in supervised machine

learning?learning?• Path Group: A path (length≥2) is

extracted per the definition of a path from graph theory. In the example, a 2nd order path group comprises two sets of records: S1={1,2,5} and S2={1,2,3,4}. A path group may be composed of several higher-order paths.

• A bipartite graph G = (V1 U V2, E) is formed where V1 is the sets of records and V2 is the records. Enumerating all maximum matchings in this graph yields all higher-order paths in the path group. Another approach is to discover the system of distinct representatives (SDR) of these sets.

S1

R1

S2

R2

R3

R4

R5

An example co-occurrence graph

e1e2

e3

e4e5

e1 e2 e3R1R2R5

R1R2R3R4

An example 2nd order path group

e1 e2 e3R1 R3

A valid 2nd order path


• Approach: Discover frequent itemsets in higher-order paths – For labeled datasets, divide instances by class and

enumerate k-itemsets (initially for k in {3,4})– Results in a distribution of k-itemset frequencies for

a given class – Compare distributions using simple statistical

measure such as t-test to determine independence – If two distributions are statistically significantly

different, we conclude that the higher-order path patterns (i.e., itemset frequencies) distinguish the classes

• Labeled training data analyzed– Mushroom dataset: performs well on decision tree– Border gateway protocol updates: relevant to

cybersecurity

What role do higher-order relations What role do higher-order relations play in supervised machine learning?play in supervised machine learning?


• For each fold; compared 3-itemset frequencies of E set vs. P set

• Interesting result: six of the 10 folds had a confidence of 95% or greater that the E and P instances are statistically significantly different– Other folds between 80-95% (see below)

Fold t Stat P(T<=t) one-tail

t_Critical one-tail

P(T<=t) two-tail

t_Critical two-tail

0 -2.684 0.0037 1.6471 0.0074 1.9634

1 -1.357 0.0875 1.6467 0.1751 1.9629

2 -1.554 0.0603 1.6468 0.1205 1.9629

3 -2.924 0.0018 1.6472 0.0036 1.9636

4 -1.908 0.0284 1.6469 0.0568 1.9631

5 -2.047 0.0205 1.6469 0.041 1.9631

6 -1.455 0.073 1.6467 0.146 1.9629

7 -2.023 0.0217 1.6469 0.0434 1.9631

8 -2.795 0.0027 1.6471 0.0053 1.9635

9 -2.71 0.0034 1.647 0.0069 1.9633

Preliminary Results – Supervised ML Preliminary Results – Supervised ML datasetdataset

Ganiz, M., Pottenger, W.M. and Yang, X. (2006). Link Analysis of Higher-Order Paths in Supervised Learning Datasets, In the Proceedings of the Workshop on Link Analysis, Counterterrorism and Security, 2006 SIAM Conference on Data Mining, Bethesda, MD, April



• Detection of Interdomain Routing Anomalies Based on Higher-Order Path Analysis– Border Gateway Protocol (BGP) is de facto

interdomain routing protocol for Internet.– Anomalous BGP events: misconfigurations,

attacks and large-scale power failures often affect the global routing infrastructure.– Slammer worm attack (January 25, 2003 )– Witty worm attack (March 19, 2004)– 2003 East Coast Blackout (i.e., power failure)

– Goal: detect and categorize such events



• Detection of Interdomain Routing Anomalies Based on Higher-Order Path Analysis– The data divided into three-second bins– Each bin is a single instance in our

training data ID Attribute Definition

1 Announce # of BGP announcements

2 Withdrawal # of BGP withdrawals

3 Update # of BGP updates(=Announce + Withdrawal )

4 Announce Prefix # of announced prefixes

5 Withdraw Prefix # of withdrawn prefixes

6 Updated Prefix # of updated prefixes(=Announce Prefix + Withdraw Prefix)


• Border Gateway Protocol (BGP) routing data– BGP messages generated during interdomain

routing – Relevant to cybersecurity– Detect abnormal BGP events

– Internet worm attacks (slammer, witty,…), power failures, etc.

– Data from a period of time surrounding/including worm propagation

– Instance ->three second sample of BGP traffic– Six numeric attributes (Li et al., 2005)

– Previously decision tree applied successfully for two classes: worm vs. normal (Li et al., 2005)– Cannot distinguish different worms!

Preliminary Results – BGP datasetPreliminary Results – BGP dataset


Preliminary Results – BGP datasetPreliminary Results – BGP dataset

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37

sliding window (Slammer)

P two-tail %5 level

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37

sliding window (Witty)

P two-tail %5 level

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37

sliding window (Blackout)

P two-tail %5 level

Event 1 Event 2 t-test results

Slammer Witty 0.00023

Blackout Witty 0.00016

Slammer Blackout 0.018

• 240 instances to characterize a particular abnormal event• Sliding window approach for detection

– Window size: 120 instances (360 seconds)– Sliding 10 instances (sampling every 30 seconds)

Ganiz, M., Pottenger, W.M., Kanitkar, S., Chuah, M.C. (2006b). Detection of Interdomain Routing Anomalies Based on Higher-Order Path Analysis. Proceedings of the Sixth IEEE International Conference on Data Mining (ICDM’06), December 2006, Hong Kong, China


Preliminary Results – Preliminary Results – Naïve Bayes on Higher-order PathsNaïve Bayes on Higher-order Paths

• Cora (McCallum et al., 2000) – Scientific paper dataset – Several classes: case based, neural networks, etc.– 2708 documents, 1433 terms, 5429 links– Terms are ordered most sparse first

• Instead of links, we used higher order paths in a Naïve Bayes framework

• E.g., when 2nd order paths are used, F-beta (beta=1) is higher starting from dictionary size 400Cora Dataset - macro averaged F1 by increasing dictionary size

0

10

20

30

40

50

60

70

80

200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1433

dictionary size

Ac

cu

rac

y

Fb1

Fb2


What role do higher-order relations play in What role do higher-order relations play in unsupervised machine learning?unsupervised machine learning?

• Next step? Consider unsupervised learning…– Association Rule Mining (ARM)

• ARM is one of the most widely used algorithms in data mining– Extend ARM to higher order… Higher Order

Apriori

• Experiments confirm the value of Higher Order Apriori on real world e-marketplace data


Higher Order Apriori: Approach Higher Order Apriori: Approach

• First we extend the itemset definition to incorporate k-itemsets up to nth-order

– Definition 1: item a and b are nth-order associated, If a and b can be associated across n distinct records.

– Definition 2: An nth-order k-itemset is a k-itemset for which each pair of its items is nth-order associated.

– Definition 3: 2 records are nth-order linked if they can be linked through n-2 distinct records.

– Definition 4: A nth-order itemset i1i2…in is supported by a nth-order recordset r1r2...rn if no two items come from the same record.

niii rrr n 121 ~~~ 21

biiia nn rn

rrr ~~~~~ 121121


Higher Order Apriori: ApproachHigher Order Apriori: Approach

• Given j instances of nth-order k-recordset rs, its

size is defined as:

• Since the same k-itemset can be generated at different orders, the global support for a given k-itemset must include the local support at each

order u. So we have the this formula:

j

t

kk

u

n

vvkn Irssize

1

2/)1(

1

1

1_ )()(

order

u

ku

k u

rssizeis

max_

1

_10 1)(log)(sup


Higher Order Apriori: ApproachHigher Order Apriori: Approach

• Higher Order Apriori is structured in a level-wise order-first manner. – Level-wise: the size of k-

itemsets increases in each iteration (as is the case for Apriori),

– Order-first: at each level,

itemsets are generated across all orders.


Higher Order Apriori: Results Higher Order Apriori: Results

• Our algorithm was tested on real-world e-commerce data, from the KDD Cup 2000 competition. there are 530 transactions involving 244 products in the dataset.

• We compared the itemsets generated by Higher Order Apriori with two other algorithms:

– Apriori (1st order) – Indirect (Using our algorithm limited to 2nd

order). – Higher Order Apriori limited to 6th order.

• We conducted experiments on multiple systems, including at the National Center for Supercomputing Applications (NCSA)


Higher Order Apriori: ResultsHigher Order Apriori: Results

• Higher Order Apriori mines significantly more final itemsets than Apriori and Indirect

33.23.43.63.8

44.24.44.6

50 75 100 200 530

Number of Records (k=2 for HOApriori and Indirect)

Lo

g (

Nu

mb

er o

f It

em s

ets)

Apriori Indirect HOApriori

• Next we will show high support itemsets are discovered using smaller datasets than required by Apriori or Indirect


Higher Order Apriori: ResultsHigher Order Apriori: Results• {CU, DQ} is the top ranked 2-itemset using Apriori

on all 530 transactions.– Neither Apriori nor Indirect, leverage the latent higher

order information in runs of 75, 100 and 200 random transactions

– While Higher Order Apriori discovered this itemset as top ranked using only 75 transactions,

– In addition, the gap between the supports increases as the transaction sets get larger

0123456

0 100 200 300 400 500 600

Number of Records

Rank

ing

Indirect HOApriori

00.5

11.5

22.5

33.5

44.5

5

0 50 100 150 200 250

Number of Records

Sup

port

Apriori Indirect HOApriori



• Discovering Novel Itemsets Apriori Indirect Higher Order

Apriori

Itemsets Discovered {AY, X }{X, K}{K, Q)

{AY, K}{X, Q}

+ Apriori Itemsets

{AY, Q} + Indirect Itemsets+ Apriori Itemsets

Itemsets Undiscovered

{AY, K} {X, Q} {AY, Q}

• AY- Girdle-at-the-top Classic Sheer Pantyhose

• Q- Men’s City Rib Socks - 3 Pack

Shaver : Women’s Pantyhose relationship

Apriori (Donna Karan’s Extra Thin Pantyhose, Wet/Dry Shaver)

Indirect (Berkshire’s Ultra Nudes Pantyhose, Epilady Wet/Dry Shaver)

Higher-order Apriori

(Donna Karan’s Pantyhose, Epilady Wet/Dry Shaver)

• This relationship is also discovered by Apriori and Indirect, but Higher order Apriori discovered a new nugget, which provides extra evidence for this relationship



• Discovering Novel Relationships – Higher Order Apriori discovers itemsets that demonstrate

novel relationships not discovered by lower order methods.

– For example, the following are reasonable relationships. While Apriori and Indirect failed to discover itemsets representing such relationships in the SIGKDD dataset, they might discover such links given a larger data set.

Shaver : Lotion/Cream

Apriori No

Indirect No

HigherOrderApriori

(Pedicure Care Kit, Toning Lotion)(wet/dry shaver, Herb Lotion)

(Pedicure Care Kit, Leg Cream)

foot cream – women’s socks

Apriori No

Indirect No

Higher-order Apriori

(foot cream, Women's Ultra Sheer Knee High)(foot cream, women’s cotton dog sock)


ConclusionsConclusions

• Many traditional machine learning algorithms assume instances are independent and identically distributed (I.I.D.) – Apply model to a single instance (decision is based on the

feature vector) in a “context-free” manner– Independent of the other instances in the test set

• Statistical Relational Learning (SRL)– Classifies a set of instances simultaneously (collective

classification)– Utilizes relations (links) between instances in dataset– Usually considers immediate neighbors– Violates “independence” assumption

• Our approach utilizes the latent information based on higher order paths – Utilizes higher order paths of order greater than or equal to

two – Higher-order paths are implicit; based on co-occurrences of

entities– We do not use the explicit links in the dataset!

– Captures “latent semantics” (aka Latent Semantic Indexing)


ThanksThanks

Q&A

Higher Order Learning

Documents

higher order learning

use of higher order

collection of instances

higher orders of cooccurrence

relational data experiments

ndorder cooccurrence

rdn model

information retrieval