Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun
Beyond Streams and Graphs:
Dynamic Tensor Analysis Jimeng Sun Christos FaloutsosDacheng
Tao
Speaker: Jimeng Sun
MotivationGoal: incremental pattern discovery on streaming applications
Streams: E1: Environmental sensor networks E2: Cluster/data center monitoring
Graphs: E3: Social network analysis
Tensors: E4: Network forensics E5: Financial auditing E6: fMRI: Brain image analysis
How to summarize streaming data effectively and incrementally?
E3: Social network analysisTraditionally, people focus on static networks and find community structuresWe plan to monitor the change of the community structure over time and identify abnormal individuals
DB
Aut
hors
Keywords
DM
DB
1990
2004
E4: Network forensicsDirectional network flowsA large ISP with 100 POPs, each POP 10Gbps link capacity [Hotnets2004]
450 GB/hour with compression
Task: Identify abnormal traffic pattern and find out the cause
normal trafficabnormal traffic
dest
inati
on
source
dest
inati
on
sourceCollaboration with Prof. Hui Zhang and Dr. Yinglian Xie
Static Data model
For a timestamp, the stream measurements can be modeled using a tensorDimension: a single stream
E.g, <Christos, “graph”>
Mode: a group of dimensions of the same kind.
E.g., Source, Destination, Port
Des
tina
tion
Source
Time = 0
Static Data model (cont.)Tensor
Formally,
Generalization of matrices
Represented as multi-array, data cube.Order 1st 2nd 3rd
Correspondence Vector Matrix 3D array
Example Sensors
Aut
hors
Keywords
SourcesD
estin
atio
nsP
orts
Dynamic Data model (our focus)
Streams come with structure(time, source, destination, port)(time, author, keyword)
time
De
stin
atio
n
Source
Dynamic Data model (cont.)Tensor Streams
A sequence of Mth order tensor
where
n is increasing over timeOrder 1st 2nd 3rd
Correspondence
Multiple streams Time evolving graphs
3D arrays
Example
Sources
Des
tinat
ions
Por
tstime
Sensors
…
time
…
au
thor
keyword
…
Dynamic tensor analysis
UDestination
USource
Old cores
Sourc
e
De
stin
atio
n
New TensorOld Tensors
Roadmap
Motivation and main ideasBackground and related workDynamic and streaming tensor analysisExperimentsConclusion
Y
Background – Singular value decomposition (SVD)
SVD
Best rank k approximation in L2PCA is an important application of SVD
Am
n
m
nRR
R
UVT k
k k
UT
Latent semantic indexing (LSI)
Singular vectors are useful for clustering or correlation detection
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 30 0 0 1 1
pattern
cluster
querycache
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.800 0.27
=DM
DB
9.64 0
0 5.29x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
document-conceptconcept-term
concept-association
frequent
Tensor Operation: Matricize X(d)
Unfold a tensor into a matrix
5 76 81 3
2 4
Acknowledge to Tammy Kolda for this slide
Tensor Operation: Mode-product
Multiply a tensor with a matrix
source
dest
inati
on
port
group
source
dest
inati
on
port
groupso
urc
e
Related workLow Rank approximationPCA, SVD: orthogonal
based projection
Multilinear analysisTensor decompositions:
Tucker, PARAFAC, HOSVD
Stream miningScan data once to
identify patternsSampling: [Vitter85],
[Gibbons98]Sketches: [Indyk00],
[Cormode03]
Graph miningExplorative: [Faloutsos04]
[Kumar99][Leskovec05]…
Algorithmic: [Yan05][Cormode05]…
Our Work
Roadmap
Motivation and main ideasBackground and related workDynamic and streaming tensor analysisExperimentsConclusion
Tensor analysisGiven a sequence of tensorsfind the projection matricessuch that the reconstruction error e is minimized:
…
…
t
Note that this is a generalization of PCA when n is a constant
Sources
Des
tinat
ions
Por
ts
Source Projection
Des
tinat
ion
Pro
ject
ion
Port Projection
Core Tensor
DB
Aut
hors
Keywords
DM
DB
UA
UK
1990
2004
1990
2004
Why do we care?
Anomaly detectionReconstruction error drivenMultiple resolution
Multiway latent semantic indexing (LSI) Philip Yu
Michael Stonebreak
er
QueryPattern
time
1st order DTA - problemGiven x1…xn where each xi RN, find
URNR such that the error e is small:
n
N
x1
xn
….
?
tim
e
Sensors
UT
indooroutdoor
Y
Sensors
R
Note that Y = XU
1st order DTAInput: new data vector x RN, old variance
matrix C RN N
Output: new projection matrix U RN R
Algorithm:1. update variance matrix Cnew = xTx + C2. Diagonalize UUT = Cnew 3. Determine the rank R and return U
xT C UUTx
Cnew
Diagonalization has to be done for every new x!
Old X
x
tim
e
1st order STAAdjust U smoothly when new data arrive without diagonalization [VLDB05]
For each new point xProject onto current lineEstimate errorRotate line in the direction of the error and in proportion to its magnitude
For each new point x and for i = 1, …, k : yi := Ui
Tx (proj. onto Ui)
di di + yi2 (energy i-th eigenval.)
ei := x – yiUi (error)
Ui Ui + (1/di) yiei (update estimate)
x x – yiUi (repeat with remainder)
error
U
Sensor 1
Sen
sor
2
Mth order DTA
dU
TdU
Reconstruct Variance Matrix
dC
dC
Update Variance Matrix
dS
Diagonalize Variance Matrix
dU
TdU
dSX(d)X(d)
dX TdX
Mat
riciz
ing,
Tra
nspo
se
Construct Variance Matrix of Incremental Tensor
Matricizing
T
Mth order DTA – complexityStorage: O( Ni), i.e., size of an input tensor at a single
timestampComputation: Ni
3 (or Ni2) diagonalization of C
+ Ni Ni matrix multiplication X (d)T X(d)
For low order tensor(<3), diagonalization is the main cost
For high order tensor, matrix multiplication is the main cost
Mth order STA
TdX
Matricizing
Run 1st order STA along each modeComplexity:
Storage: O( Ni)
Computation: Ri Ni which is smaller than DTAy1
U1
xe1
U1 updated
Roadmap
Motivation and main ideasBackground and related workDynamic and streaming tensor analysisExperimentsConclusion
Experiment
ObjectivesComputational efficiencyAccurate approximationReal applications
Anomaly detection Clustering
Data set 1: Network dataTCP flows collected at CMU backboneRaw data 500GB with compressionConstruct 3rd order tensors with hourly windows with <source, destination, port>Each tensor: 500500100 1200 timestamps (hours)
Sparse data10AM to 11AM on 01/06/2005
valuePower-law distribution
Data set 2: Bibliographic data (DBLP)
Papers from VLDB and KDD conferencesConstruct 2nd order tensors with yearly windows with <author, keywords> Each tensor: 45843741 11 timestamps (years)
Computational cost
3rd order network tensor 2nd order DBLP tensorOTA is the offline tensor analysisPerformance metric: CPU time (sec)Observations:
DTA and STA are orders of magnitude faster than OTAThe slight upward trend in DBLP is due to the increasing number of papers each year (data become denser over time)
Accuracy comparison
Performance metric: the ratio of reconstruction error between DTA/STA and OTA; fixing the error of OTA to 20%Observation: DTA performs very close to OTA in both datasets, STA performs worse in DBLP due to the bigger changes.
3rd order network tensor 2nd order DBLP tensor
Network anomaly detection
Reconstruction error gives indication of anomalies.Prominent difference between normal and abnormal ones is mainly due to unusual scanning activity (confirmed by the campus admin).
Reconstruction error over time
Normal trafficAbnormal traffic
Multiway LSIAuthors Keywords Yearmichael carey, michaelstonebreaker, h. jagadish,hector garcia-molina
queri,parallel,optimization,concurr,objectorient
1995
surajit chaudhuri,mitch cherniack,michaelstonebreaker,ugur etintemel
distribut,systems,view,storage,servic,process,cache
2004
jiawei han,jian pei,philip s. yu,jianyong wang,charu c. aggarwal
streams,pattern,support, cluster, index,gener,queri
2004
Two groups are correctly identified: Databases and Data miningPeople and concepts are drifting over time
DB
DM
Conclusion
Tensor stream is a general data modelDTA/STA incrementally decompose tensors into core tensors and projection matricesThe result of DTA/STA can be used in other applications
Anomaly detectionMultiway LSI
The world is not flat, neither should data mining be.
Final word: Think structurally!
Contact: Jimeng Sun [email protected]