This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Faloutsos 15-826
1
CMU SCS
15-826: Multimedia Databases and Data Mining
Lecture #21: Tensor decompositions C. Faloutsos
CMU SCS
2
Must-read Material • Tamara G. Kolda and Brett W. Bader.
Tensor decompositions and applications. Technical Report SAND2007-6702, Sandia National Laboratories, Albuquerque, NM and Livermore, CA, November 2007
CMU SCS
3
Outline Goal: ‘Find similar / interesting things’ • Intro to DB • Indexing - similarity search • Data Mining
• CANDECOMP = Canonical Decomposition (Carroll & Chang, 1970) • PARAFAC = Parallel Factors (Harshman, 1970) • Core is diagonal (specified by the vector λ) • Columns of A, B, and C are not orthonormal • If R is minimal, then R is called the rank of the tensor (Kruskal 1977) • Can have rank( ) > min{I,J,K}
¼
I x R
K x R
A B
J x R
C
R x R x R
I x J x K
+…+ =
CMU SCS
44
Tucker vs. PARAFAC Decompositions • Tucker
– Variable transformation in each mode
– Core G may be dense – A, B, C generally
orthonormal – Not unique
• PARAFAC – Sum of rank-1 components – No core, i.e., superdiagonal
core – A, B, C may have linearly
dependent columns – Generally unique
I x J x K
~ A
I x R
B J x S
C K x T
R x S x T
I x J x K
+…+ ~
a1 aR
b1
c1
bR
cR
IMPORTANT
Faloutsos 15-826
12
CMU SCS
45
Tensor tools - summary • Two main tools
– PARAFAC – Tucker
• Both find row-, column-, tube-groups – but in PARAFAC the three groups are identical
• To solve: Alternating Least Squares
• Toolbox: from Tamara Kolda: http://csmr.ca.sandia.gov/~tgkolda/TensorToolbox/
CMU SCS
46
Outline
• Motivation - Definitions • Tensor tools • Case studies
– P1: web graph mining (‘TOPHITS’) – P2: phone-call patterns – P3: N.E.L.L. (never ending language learner)
CMU SCS
47
P1: Web graph mining
• How to order the importance of web pages? – Kleinberg’s algorithm HITS – PageRank – Tensor extension on HITS (TOPHITS)
CMU SCS
48
P1: Web graph mining
• T. G. Kolda, B. W. Bader and J. P. Kenny, Higher-Order Web Link Analysis Using Multilinear Algebra, ICDM 2005: ICDM, pp. 242-249, November 2005, doi:10.1109/ICDM.2005.77. [PDF]
Faloutsos 15-826
13
CMU SCS
49
Kleinberg’s Hubs and Authorities (the HITS method)
Sparse adjacency matrix and its SVD:
authority scores for 1st topic
hub scores for 1st topic
hub scores for 2nd topic
authority scores for 2nd topic
from
to
Kleinberg, JACM, 1999
CMU SCS
50
authority scores for 1st topic
hub scores for 1st topic
hub scores for 2nd topic
authority scores for 2nd topic
from
to
HITS Authorities on Sample Data .97 www.ibm.com.24 www.alphaworks.ibm.com.08 www-128.ibm.com.05 www.developer.ibm.com.02 www.research.ibm.com.01 www.redbooks.ibm.com.01 news.com.com
1st Principal Factor
.99 www.lehigh.edu
.11 www2.lehigh.edu
.06 www.lehighalumni.com
.06 www.lehighsports.com
.02 www.bethlehem-pa.gov
.02 www.adobe.com
.02 lewisweb.cc.lehigh.edu
.02 www.leo.lehigh.edu
.02 www.distance.lehigh.edu
.02 fp1.cc.lehigh.edu
2nd Principal FactorWe started our crawl from
http://www-neos.mcs.anl.gov/neos, and crawled 4700 pages,
resulting in 560 cross-linked hosts.
.75 java.sun.com
.38 www.sun.com
.36 developers.sun.com
.24 see.sun.com
.16 www.samag.com
.13 docs.sun.com
.12 blogs.sun.com
.08 sunsolve.sun.com
.08 www.sun-catalogue.com
.08 news.com.com
3rd Principal Factor
.60 www.pueblo.gsa.gov
.45 www.whitehouse.gov
.35 www.irs.gov
.31 travel.state.gov
.22 www.gsa.gov
.20 www.ssa.gov
.16 www.census.gov
.14 www.govbenefits.gov
.13 www.kids.gov
.13 www.usdoj.gov
4th Principal Factor
.97 mathpost.asu.edu
.18 math.la.asu.edu
.17 www.asu.edu
.04 www.act.org
.03 www.eas.asu.edu
.02 archives.math.utk.edu
.02 www.geom.uiuc.edu
.02 www.fulton.asu.edu
.02 www.amstat.org
.02 www.maa.org
6th Principal Factor
CMU SCS
51
Three-Dimensional View of the Web
Observe that this tensor is very sparse!
Kolda, Bader, Kenny, ICDM05
CMU SCS
52
Topical HITS (TOPHITS) Main Idea: Extend the idea behind the HITS model to incorporate term (i.e., topical) information.
authority scores for 1st topic
hub scores for 1st topic
hub scores for 2nd topic
authority scores for 2nd topic
from
to
term
term scores for 1st topic
term scores for 2nd topic
Faloutsos 15-826
14
CMU SCS
53
Topical HITS (TOPHITS) Main Idea: Extend the idea behind the HITS model to incorporate term (i.e., topical) information.
authority scores for 1st topic
hub scores for 1st topic
hub scores for 2nd topic
authority scores for 2nd topic
from
to
term
term scores for 1st topic
term scores for 2nd topic
CMU SCS
54
TOPHITS Terms & Authorities on Sample Data
.23 JAVA .86 java.sun.com
.18 SUN .38 developers.sun.com
.17 PLATFORM .16 docs.sun.com
.16 SOLARIS .14 see.sun.com
.16 DEVELOPER .14 www.sun.com
.15 EDITION .09 www.samag.com
.15 DOWNLOAD .07 developer.sun.com
.14 INFO .06 sunsolve.sun.com
.12 SOFTWARE .05 access1.sun.com
.12 NO-READABLE-TEXT .05 iforce.sun.com
1st Principal Factor
.20 NO-READABLE-TEXT .99 www.lehigh.edu
.16 FACULTY .06 www2.lehigh.edu
.16 SEARCH .03 www.lehighalumni.com
.16 NEWS
.16 LIBRARIES
.16 COMPUTING
.12 LEHIGH
2nd Principal Factor
.15 NO-READABLE-TEXT .97 www.ibm.com
.15 IBM .18 www.alphaworks.ibm.com
.12 SERVICES .07 www-128.ibm.com
.12 WEBSPHERE .05 www.developer.ibm.com
.12 WEB .02 www.redbooks.ibm.com
.11 DEVELOPERWORKS .01 www.research.ibm.com
.11 LINUX
.11 RESOURCES
.11 TECHNOLOGIES
.10 DOWNLOADS
3rd Principal Factor
.26 INFORMATION .87 www.pueblo.gsa.gov
.24 FEDERAL .24 www.irs.gov
.23 CITIZEN .23 www.whitehouse.gov
.22 OTHER .19 travel.state.gov
.19 CENTER .18 www.gsa.gov
.19 LANGUAGES .09 www.consumer.gov
.15 U.S .09 www.kids.gov
.15 PUBLICATIONS .07 www.ssa.gov
.14 CONSUMER .05 www.forms.gov
.13 FREE .04 www.govbenefits.gov
4th Principal Factor
.26 PRESIDENT .87 www.whitehouse.gov
.25 NO-READABLE-TEXT .18 www.irs.gov
.25 BUSH .16 travel.state.gov
.25 WELCOME .10 www.gsa.gov
.17 WHITE .08 www.ssa.gov
.16 U.S .05 www.govbenefits.gov
.15 HOUSE .04 www.census.gov
.13 BUDGET .04 www.usdoj.gov
.13 PRESIDENTS .04 www.kids.gov
.11 OFFICE .02 www.forms.gov
6th Principal Factor
.75 OPTIMIZATION .35 www.palisade.com
.58 SOFTWARE .35 www.solver.com
.08 DECISION .33 plato.la.asu.edu
.07 NEOS .29 www.mat.univie.ac.at
.06 TREE .28 www.ilog.com
.05 GUIDE .26 www.dashoptimization.com
.05 SEARCH .26 www.grabitech.com
.05 ENGINE .25 www-fp.mcs.anl.gov
.05 CONTROL .22 www.spyderopts.com
.05 ILOG .17 www.mosek.com
12th Principal Factor
.46 ADOBE .99 www.adobe.com
.45 READER
.45 ACROBAT
.30 FREE
.30 NO-READABLE-TEXT
.29 HERE
.29 COPY
.05 DOWNLOAD
13th Principal Factor
.50 WEATHER .81 www.weather.gov
.24 OFFICE .41 www.spc.noaa.gov
.23 CENTER .30 lwf.ncdc.noaa.gov
.19 NO-READABLE-TEXT .15 www.cpc.ncep.noaa.gov
.17 ORGANIZATION .14 www.nhc.noaa.gov
.15 NWS .09 www.prh.noaa.gov
.15 SEVERE .07 aviationweather.gov
.15 FIRE .06 www.nohrsc.nws.gov
.15 POLICY .06 www.srh.noaa.gov
.14 CLIMATE
16th Principal Factor
.22 TAX .73 www.irs.gov
.17 TAXES .43 travel.state.gov
.15 CHILD .22 www.ssa.gov
.15 RETIREMENT .08 www.govbenefits.gov
.14 BENEFITS .06 www.usdoj.gov
.14 STATE .03 www.census.gov
.14 INCOME .03 www.usmint.gov
.13 SERVICE .02 www.nws.noaa.gov
.13 REVENUE .02 www.gsa.gov
.12 CREDIT .01 www.annualcreditreport.com
19th Principal Factor
TOPHITS uses 3D analysis to find the dominant groupings of web pages and terms.
authority scores for 1st topic
hub scores for 1st topic
hub scores for 2nd topic
authority scores for 2nd topic fro
m
to
term
term scores for 1st topic
term scores for 2nd topic
Tensor PARAFAC
wk = # unique links using term k
CMU SCS P2: Anomaly detection in
time-evolving graphs
• Anomalous communities in phone call data: – European country, 4M clients, data over 2 weeks
~200 calls to EACH receiver on EACH day!
1 caller 5 receivers 4 days of activity
55
=
CMU SCS P2: Anomaly detection in
time-evolving graphs
• Anomalous communities in phone call data: – European country, 4M clients, data over 2 weeks
~200 calls to EACH receiver on EACH day!
1 caller 5 receivers 4 days of activity
56
=
Faloutsos 15-826
15
CMU SCS P2: Anomaly detection in
time-evolving graphs
• Anomalous communities in phone call data: – European country, 4M clients, data over 2 weeks
~200 calls to EACH receiver on EACH day!
57
=
Miguel Araujo, Spiros Papadimitriou, Stephan Günnemann, Christos Faloutsos, Prithwish Basu, Ananthram Swami, Evangelos Papalexakis, Danai Koutra. Com2: Fast Automatic Discovery of Temporal (Comet) Communities. PAKDD 2014, Tainan, Taiwan.
CMU SCS
GigaTensor: Scaling Tensor Analysis Up By 100 Times –
Algorithms and Discoveries
U Kang
Christos Faloutsos
KDD 2012
Evangelos Papalexakis
Abhay Harpale
58
CMU SCS
P3: N.E.L.L. analysis • NELL: Never Ending Language Learner
– Q1: dominant concepts / topics? – Q2: synonyms for a given new phrase?
59
“Barrack Obama is the president of U.S.”
“Eric Clapton plays guitar”
(26M)
(48M)NELL (Never Ending
Language Learner) Nonzeros =144M
(26M)
CMU SCS
A1: Concept Discovery
• Concept Discovery in Knowledge Base
60
Faloutsos 15-826
16
CMU SCS A1: Concept Discovery
61
CMU SCS
A2: Synonym Discovery
• Synonym Discovery in Knowledge Base
a1a2 aR…
(Given) subject(Discovered) synonym 1
(Discovered) synonym 262
CMU SCS A2: Synonym Discovery
63
CMU SCS
64
Conclusions
• Real data are often in high dimensions with multiple aspects (modes)
• Matrices and tensors provide elegant theory and algorithms
I x J x K
+…+ ~
a1 aR
b1
c1
bR
cR
Faloutsos 15-826
17
CMU SCS
65
References • Inderjit S. Dhillon, Subramanyam Mallela, Dharmendra
S. Modha: Information-theoretic co-clustering. KDD 2003: 89-98
• T. G. Kolda, B. W. Bader and J. P. Kenny. Higher-Order Web Link Analysis Using Multilinear Algebra. In: ICDM 2005, Pages 242-249, November 2005.
• Jimeng Sun, Spiros Papadimitriou, Philip Yu. Window-based Tensor Analysis on High-dimensional and Multi-aspect Streams, Proc. of the Int. Conf. on Data Mining (ICDM), Hong Kong, China, Dec 2006