Introduction Theory Experimentation Conclusion Mining Associations Using Directed Hypergraphs Ramanuja Simha (University of Delaware), Rahul Tripathi (Wallmart, Information Services), Mayur Thakur (Google Inc.) ICDE 2012 Simha et al Mining Associations Using Directed Hypergraphs 1 of 43
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
IntroductionTheory
ExperimentationConclusion
Mining Associations Using Directed Hypergraphs
Ramanuja Simha (University of Delaware),Rahul Tripathi (Wallmart, Information Services),
Mayur Thakur (Google Inc.)
ICDE 2012
Simha et al Mining Associations Using Directed Hypergraphs 1 of 43
The support and confidence measures are generalized formulti-valued attributes in a database D(A,O,V) as follows:
1 Let X = {(Ai1, vj1), (Ai2, vj2), . . . , (Air , vjr )} be any subset ofA× V. The support of X , denoted by Supp(X ), is defined asthe fraction of observations in D for which Ai1 takes value vj1,Ai2 takes value vj2, . . ., and Air takes value vjr .
2 Let Xmva=⇒Y be an mva-type association rule. Then the
confidence of this rule, denoted by Conf(Xmva=⇒Y ), is defined
as follows:
Conf(Xmva=⇒Y ) =
Supp(X ∪ Y )
Supp(X ).
Simha et al Mining Associations Using Directed Hypergraphs 11 of 43
Consider the mva-type association rule Xmva=⇒Y , where
X = {(G2, ↓), (G3, ↓)} and Y = {(G4, ↑)}.This rule means: “If gene 2 and gene 3 in a patient are underexpressed (low), then it is likely that gene 4 is over expressed(high) in the patient.”
Supp(X ) is the fraction of observations where G2 =↓ andG3 =↓, i.e., Supp(X ) = 7/8 = 0.875.
An association hypergraph H for a database D is a directedhypergraph in which vertices are attributes of D and directedhyperedges connect one subset of vertices to another disjointsubset.
Each directed hyperedge e = (T ,H), say, T = {A1,A2} andH = {A3}, has an association confidence value ACV (e) in therange [0, 1].
Definition (Association confidence)
The association confidence value of a directed hyperedge({A1,A2}, {A3}) equals∑
v1,v2
Supp({(A1, v1), (A2, v2)})× Conf({(A1, v1), (A2, v2)}mva=⇒(A3, v
∗3 )).
Simha et al Mining Associations Using Directed Hypergraphs 16 of 43
Fix a combination of two or fewer attributes, say {A1,A2},and any other attribute, say A3.
Determine whether ({A1,A2}, {A3}) could be included as adirected hyperedge of H.
Include the directed hyperedge if it is γ-significant.
Definition (γ-significance)
Consider a combination (T ,H) for inclusion as a directedhyperedge of the association hypergraph H, where |T | ≥ 1. Forγ ≥ 1, we say that (T ,H) is γ-significant ifACV (T ,H) ≥ γ ·maxv∈T{ACV (T − {v},H)}.
Simha et al Mining Associations Using Directed Hypergraphs 19 of 43
Out-similarity of A1 and A2, denoted by out-simH(A1,A2), isthe weighted fraction of directed hyperedge pairs (e, f ), wheree ∈ outH(A1) and f ∈ outH(A2), such that switching A1 toA2 in the tail set of e results in f 1.
In-similarity of A1 and A2, denoted by in-simH(A1,A2), is theweighted fraction of directed hyperedge pairs (e, f ), wheree ∈ inH(A1) and f ∈ inH(A2), such that switching A1 to A2
in the head set of e results in f .
1For any A ∈ V , outH(A) denotes the set of all directed hyperedges of Hwhose tail set contains A. Similarly inH(A) follows.
Simha et al Mining Associations Using Directed Hypergraphs 20 of 43
Construct a similarity graph in time O(m2) where m is thenumber of directed hyperedges in H.
Use the t-clustering algorithm by Gonzalez [Go85] to find aclustering of attributes in time O(|t| · |S|).
Definition (Similarity graphs)
Let H = (V ,E ) be an association hypergraph. Given any collectionS of attributes, a similarity graph SGS = (V ′,E ′) induced by S inH is an undirected, weighted, complete graph whose node set V ′ isS and edge set E ′ contains all attribute pairs in S such that, forevery edge {A1,A2} ∈ E ′, its weight d(A1,A2) is defined as 1 −(weighted-in-simH(A1,A2) + weighted-out-simH(A1,A2)) / 2.
Simha et al Mining Associations Using Directed Hypergraphs 22 of 43
A leading indicator X for a set S of attributes is a subset of Ssuch that knowing values for the attributes in X allows us toinfer the values for all attributes in S − X .
Definition (Leading indicators)
A dominator for a set S of vertices in an association hypergraphH = (V ,E ) is a set X ⊆ V such that, for every u ∈ S − X , thereis a directed hyperedge e = (T ,H) ∈ E such that T ⊆ X andu ∈ H. That is, each node u ∈ S − X is covered using onlydirected hyperedges whose tail set is from the set X .
Simha et al Mining Associations Using Directed Hypergraphs 23 of 43
The greedy approach is: for every node u that is not part ofthe dominating set yet, the algorithm computes the nodeeffectiveness α(u) that reflects u’s covering ability.
Node with the highest effectiveness value is added to thedominator set.
Greedy algorithm runs in time O(|S| · |E |).
Simha et al Mining Associations Using Directed Hypergraphs 24 of 43
We compute the value for any attribute Y ∈ T given valuesv1, v2, . . . , vt ∈ V of a set S of the attributes.
Examine association table (AT) of directed hyperedges of theform e = ({A1,A2}, {Y }) whose tailsets are subset of S.Use AT (e) to find Supp({(A1, v1), (A2, v2)}) and
Conf({(A1, v1), (A2, v2)}mva=⇒{(Y , y)}).
Find contribution of e in the value assignment y of Y bycomputing Supp({(A1, v1), (A2, v2)})×Conf({(A1, v1), (A2, v2)}mva
=⇒{(Y , y)}).The total contribution over all directed hyperedges in the valueassignment y of Y is denoted by val[y ].Choose the value y∗ of Y for which val[y∗] is maximum.
The algorithm runs in time O(|V|2 · |T | · |E |).
Simha et al Mining Associations Using Directed Hypergraphs 27 of 43
Input : An association hypergraph H = (V , E) modeling attribute relationships, a set T of attributes, and a setS = {(A1, v1), (A2, v2), . . . , (At , vt )}, where A1, A2, . . . , At are attributes and v1, v2, . . . , vt ∈ Vare their respective values.
Output: An assignment of values that assigns each attribute Y ∈ T its best classified value y∗ and val[y∗]associated with every such assignment y∗ to Y .
begin1foreach attribute Y ∈ T do2
for y ← 1 to k do3val[y ]← 0;4
end5foreach directed hyperedge e = (T ,H) ∈ E with H = {Y} and T ⊆ {A1, A2, . . . , At} do6
Let T be {A1, A2} and let y be the most frequent value of Y given “A1 = v1” and “A2 = v2”;7val[y ]← val[y ] + Supp({(A1, v1), (A2, v2)}) × Conf({(A1, v1), (A2, v2)}mva
=⇒{(Y , y)});8end9Let y∗ ∈ {1, . . . , k} be such that val[y∗] = maxy∈{1,...,k} val[y ];10val[y∗]← val[y∗]/
Py∈{1,...,k} val[y ];11
Output “(Y , y∗, val[y∗])”;12end13
end14
Simha et al Mining Associations Using Directed Hypergraphs 28 of 43
IntroductionTheory
ExperimentationConclusion
Association CharacteristicsAssociation-Based SimilarityComputational Problems
Experimental Data
Experimental analysis is based on financial time-series datafrom January 1, 1995 to December 31, 2009, since a numberof companies in S&P 500 started trading in the mid 90s. Thenumber of financial time-series in our analysis is 346.
Financial time-series data belongs to industrial sectors such asBasic Materials (BM), Capital Goods (CG), Conglomerates(C), Consumer Cyclical (CC), Consumer Noncyclical (CN ),Energy (E), Financial (F), Healthcare (H), Services (SV),Technology (T ), Transportation (T P), and Utilities (U).
Simha et al Mining Associations Using Directed Hypergraphs 29 of 43
IntroductionTheory
ExperimentationConclusion
Association CharacteristicsAssociation-Based SimilarityComputational Problems
Modeling Database as Association Hypergraphs
Association Hypergraph Construction:
Convert each financial time-series into a delta time-series.
In the delta time-series, the i ’th entry is the fractional changein the closing stock price of the (i + 1)’th day relative to theclosing stock price of the i ’th day.
Discretize the delta time-series values.
For discretization, we use two configurations.
C 1 shows results for |V| = 3 and C 2 shows results for |V| = 5.C 1 leads to 106, 475 directed edges and 157, 412 2-to-1directed hyperedges.C 2 leads to 109, 810 directed edges and 274, 048 2-to-1directed hyperedges.
Simha et al Mining Associations Using Directed Hypergraphs 30 of 43
IntroductionTheory
ExperimentationConclusion
Association CharacteristicsAssociation-Based SimilarityComputational Problems
Highest ACV Directed Edges and Directed Hyperedges
Row Time-series Configuration Top directed edge Top 2-to-1 directed hyperedge
11 TE (U) C1 PGN (U)→ TE (U) PEG (U), SO (U)→ TE (U)C2 AEP (U)→ TE (U) SO (U), TEG (U)→ TE (U)
Simha et al Mining Associations Using Directed Hypergraphs 31 of 43
IntroductionTheory
ExperimentationConclusion
Association CharacteristicsAssociation-Based SimilarityComputational Problems
Highest ACV Directed Edges and Directed Hyperedges
Best prediction directed edge for GT (The Goodyear Tire &Rubber Company) - PPG (BM) → GT (CC).
Interpreted in terms of GT procuring raw materials (e.g.,precipitated silicas) from PPG for the manufacturing orprocessing of rubber [PP10].
Best prediction 2-to-1 directed hyperedge for GT - DOW(BM), F (CC) → GT (CC).
Interpreted in terms of GT procuring raw materials (e.g.,polyurethane polymer) from DOW [HR01, Do10], whereas therelationship with F may be attributed towards F utilizing theproducts (e.g., tires) from GT [Wa01].
Simha et al Mining Associations Using Directed Hypergraphs 32 of 43
IntroductionTheory
ExperimentationConclusion
Association CharacteristicsAssociation-Based SimilarityComputational Problems
11 TE (U) C1 PEG, SO→ TE (0.55) PEG→ TE (0.52) SO→ TE (0.52)C2 SO, TEG→ TE (0.4) SO→ TE (0.35) TEG→ TE (0.35)
Simha et al Mining Associations Using Directed Hypergraphs 33 of 43
IntroductionTheory
ExperimentationConclusion
Association CharacteristicsAssociation-Based SimilarityComputational Problems
Comparison with Euclidean Similarity (C1)
Simha et al Mining Associations Using Directed Hypergraphs 34 of 43
IntroductionTheory
ExperimentationConclusion
Association CharacteristicsAssociation-Based SimilarityComputational Problems
Clusters of Financial Time-Series (C1)
Simha et al Mining Associations Using Directed Hypergraphs 35 of 43
IntroductionTheory
ExperimentationConclusion
Association CharacteristicsAssociation-Based SimilarityComputational Problems
Association-Based Classifier
Methodology:
Construct an association hypergraph using training data set.
Discretize the test data set.
Choose a subset S (dominator) and fix values of everyfinancial time-series in S to the value in the test data set.
Use the association-based classifier to obtain a prediction forall financial time-series that are not part of the dominator.
Classification confidence for any financial time-series A is thefraction of days the predicted value of A matches the value inthe discretized representation of A in the test data set.
Simha et al Mining Associations Using Directed Hypergraphs 36 of 43
IntroductionTheory
ExperimentationConclusion
Association CharacteristicsAssociation-Based SimilarityComputational Problems
Leading Indicators of Financial Time-Series andAssociation-Based Classifier: Experimental Results
Simha et al Mining Associations Using Directed Hypergraphs 37 of 43
IntroductionTheory
ExperimentationConclusion
Association CharacteristicsAssociation-Based SimilarityComputational Problems
Classification Confidence Distribution
Simha et al Mining Associations Using Directed Hypergraphs 38 of 43
IntroductionTheory
ExperimentationConclusion
Conclusions and Future Work
Conclusions and Future Work
We proposed a directed hypergraph based model to captureattribute-level associations for any database using which weaddressed problems such as similarity, clustering, leadingindicators, and classification.
We tested the model on a financial time-series data set (S&P500) and demonstrated its consistency through theexperimental results.
Future work includes:
Understanding how the different parameters (|V|, γ, and thesizes of head and tail sets) affect the model.Exploring associations in various domains by applying thedirected hypergraph model on data sets such as genedatabases, social network data sets, and medical databases.
Simha et al Mining Associations Using Directed Hypergraphs 39 of 43
IntroductionTheory
ExperimentationConclusion
Conclusions and Future Work
References
[BZ07] B. Bringmann and A. ZimmermannThe Chosen Few: On Identifying Valuable Patterns.ICDM 07 63-72.
[Do10] The Dow Chemical CompanyDow Chemical Products.http://www.dow.com/products services/division/auto.htm.
[Go85] T. GonzalezClustering to Minimize the Maximum Intercluster Distance.Theoretical Computer Science 38:293-306.
[HK98] E. Han and G. Karypis and V. Kumar and B. MobasherHypergraph Based Clustering in High-Dimensional Data Sets: A Summary ofResults.IEEE Data Eng. Bull. 21(1):15-22.
[HR01] Highbeam ResearchGoodyear revisits PU tyre.http://www.highbeam.com/doc/1G1-81891336.html.
Simha et al Mining Associations Using Directed Hypergraphs 40 of 43
IntroductionTheory
ExperimentationConclusion
Conclusions and Future Work
References
[KH06] A. Knobbe and E. HoMaximally Informative k-Itemsets and Their Efficient Discovery.KDD 06 237-244.
[LS97] B. Lent and A. Swami and J. WidomClustering Association Rules.ICDE 97 220-231.
[OA04] M. Ozdal and C. AykanatHypergraph Models and Algorithms for Data-Pattern-Based ClusteringData Mining and Knowledge Discovery 9(1):29-57.