Trajectory Data Mining and Management Hsiao-Ping Tsai @ CSIE, YuanZe Uni. 2009.12.04.

Trajectory Data Mining and Management

Hsiao-Ping Tsai 蔡曉萍 @ CSIE, YuanZe Uni.

2009.12.04

Outline

Introduction to Data Mining

Background of Trajectory Data Mining

Part I: Group Movement Patterns Mining

Part II: Semantic Data Compression

Why Data Mining?

The explosive growth of data - toward petabyte scale Commerce: Web, e-commerce, bank/Credit transactions, …

Science: Remote sensing, bioinformatics, …

Many others: news, digital cameras, books, magazine, …

We are drowning in data, but starving for knowledge!

Somebody~

Help~~~~

What Is Data Mining? Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously

unknown and potentially useful) knowledge, e.g., rules,

regularities, patterns, constraints, from huge amount of data

Data Mining

Database Technology Statistics

MachineLearning

PatternRecognition

Algorithm

OtherDisciplines

Visualization

Neural NetworkGraph Theory

Confluence of Multiple Disciplines

Potential Applications

Data analysis and decision support Market analysis and management

Risk analysis and management

Fraud detection and detection of unusual patterns (outliers)

Other Applications Text mining and Web mining

Stream data mining

Bioinformatics and bio-data analysis

Data Mining Functionalities (1/2) Multidimensional concept description: Characterization

and discrimination Generalize, summarize, and contrast data characteristics

Frequent patterns, association, correlation vs. causality Diaper Beer [0.5%, 75%]

Discovering relation between data items.

Classification and prediction Construct models that describe and distinguish classes

Predict some unknown or missing numerical values

Data Mining Functionalities (2/2) Cluster analysis

Clustering: Group data to form classes Maximizing intra-class similarity and minimizing interclass

similarity

Outlier analysis Outlier: Data object that does not comply with the general

behavior of the data Useful in fraud detection, rare events (exception) analysis

Trend and evolution analysis Trend and deviation, e.g., regression analysis Sequential pattern mining, Periodicity analysis, Similarity-

based analysis

Outline

Introduction to Data Mining

Background of Trajectory Data Mining

Trajectory data are everywhere! The world becomes more and more mobile

Prevalence of mobile devices, e.g., smart phones, car PNDs, NBs, PDAs, …

Satellite, sensor, RFID, and wireless technologies have fostered many applications Tremendous amounts of trajectory data

Market Prediction: 25-50% of cellphones in 2010 will have GPS

Related Research Projects (1/2) GeoPKDD: Geographic Privacy-aware Knowledge Discovery and

Delivery (Pisa Uni., Priaeus Uni, …) MotionEye: Querying and Mining Large Datasets of Moving Objects

(UIUC) GeoLife: Building social networks using human location history

(Microsoft Researcch) Reality Mining (MIT Media Lab) Data Mining in Spatio-Temporal Data Sets (Australia's ICT Research

Centre of Excellence) Trajectory Enabled Service Support Platform for Mobile Users'

Behavior Pattern Mining (IBM China Research Lab) U.S. Army Research Laboratory

Related Research Prjoects (2/2) Mobile Data Management (李強教授＠ CSIE.NCKU) Energy efficient strategies for object tracking in sensor

networks: A data mining approach (曾新穆教授＠CSIE.NCKU)

Object tracking and moving pattern mining (彭文志教授@CSIE.NCTU)

Mining Group Patterns of Mobile Users (黃三義教授@CSIE.NSYSU)

Wireless Sensor Networks (1/2) Technique advances in wireless sensor network (WSN) are

promising for various applications Object Tracking Military Surveillance Dwelling Security …

These applications generate large amounts of location-related data, and many efforts are devoted to compiling the data to extract useful information Past behavior analysis Future behavior prediction and estimation

Wireless Sensor Networks (2/2) A wireless sensor network (WSN) is composed of a

large number of sensor nodes Each node consists of sensing, processing, and

communicating components WSNs are data driven Energy conservation is paramount among all design issues

Object tracking is viewed as a killer application of WSNs A task of detecting a moving object’s location and

reporting the location data to the sink periodically Tracking moving objects is considered most challenging

Hsiao-Ping Tsai, De-Nian Yang, and Ming-Syan Chen, “Exploring Group Moving Pattern for Tracking Objects Efficiently,” accepted by IEEE Trans. on Knowledge and Data Engineering (TKDE) , 2009

o2sink

o1O1,sj

o2sink

(a) Monitoring each object individually. (b) Monitoring multiple objects with group data aggregation.

Motivation Many applications are more concerned with the group

relationships and their aggregated movement patterns Movements of creatures have some degree of regularity Many creatures are socially aggregated and migrate together

The application level semantics can be utilized to track objects in efficient ways Data aggregation In-network scheduling Data compression

Assumptions Objects each have a globally unique ID A hierarchical structure of WSN, where each sensor

within a cluster has a locally unique ID, ex. Location of an object is modeled by the ID of a nearby

sensor (or cluster) The trajectory of a moving object is thus modeled as a

series of observations and expressed by a location sequence

a, b, ..., p

Problem Formulation Similarity

Given the similarity measure function simp and a minimal threshold simmin, oi and oj are similar if their similarity score simp(oi, oj) is above simmin, i.e.,

Group A set of objects g is a group if , where so(oi)

denotes the set of objects that are similar to oi

The moving object clustering (MOC)problem: Given a set of moving objects O together with their associated location sequence dataset S and a minimal threshold simmin, the MOC problem is formulated as partitioning O into non-overlapped groups, denoted by G = {g1, g2, ..., gi}, such that the number of groups is minimized, i.e., |G| is minimal.

Challenges of the MOC Problem How to discover the group relationships?

A centralized approach?

Compare similarity on entire movement trajectories?

Compiling all data at a single node is expensive!

Local characteristics might be blurred!

Other issuesHeterogeneous data from different tracking configurations Trade-off between resolution and privacy-preserving

A distributed mining approach is more desirable

The proposed DGMPMine Algorithm To resolve the MOC problem, we propose a

distributed group movement pattern mining algorithmProvide Transmission

Efficiency

Improve discriminability

Improve Clustering Quality

Provide Flexibility

Preserve Privacy

Definition of a Significant Movement Pattern A subsequence that occurs more frequently carries more

information about the movement of an object Movement transition distribution characterizes the

movements of an object Definition of a movement pattern

A subsequence s of a sequence S is significant if its occurrence probability is higher than a minimal threshold, i.e., P(s) ≥ Pmin

A significant movement pattern is a significant subsequence s together with its transition distribution P(δ|s) with the constraint that P(δ|s) of s must differ from P(δ|suf(s)) with a ratio r or 1/r

Learning of Significant Movement Patterns

Leaning movement patterns in the trajectory data set by Probabilistic Suffix Tree

PST is an implementation of VMM with least storage requirement

The PST building algorithm learns from a location sequence data set and generate a compact tree with O(n) complexity in both computing and space

Storing significant movement patterns and their empirical probabilities and conditional empirical probabilities of movement patterns

Advantages: Useful and efficient for prediction Controllable tree depth (size)

PT" nokjfb"

0.05 1 1 1 1 0.33

0.0165.

Example of a location sequence and the generated PST

Prediction Complexity: OL

b:0.33 e:0.67

a:0.16b:0.05 e:0.11f:0.16j:0.16k:0.16n:0.05o:0.16

a b c d

e f g h

i j k l

m n o p

location sequence: okjfba,okjfea,nokjfea

Pmin 0.01Lmax 4 0 min 0r 1.25

Similarity Comparison

A novel pattern-based similarity measure is proposed to compare the similarity of objects. Measuring the similarity of two objects based on their

movement patterns Providing better scalability and resilience to outliers Free from sequence alignment and variable length handling

Considering not only the patterns shared by two objects but also their relative importance to individual objects Provide better discriminability

The Novel Similarity Measure Simp

Simp computes the similarity of objects oi and oj based on their PSTs as follows,

: The union of significant patterns on the T1 and T2

Lmax : The maximal length of VMM (or maximal depth of a PST)

Σ : The alphabet of symbols (IDs of a cluster of sensors)

PTis : The predicted value of the occurrence probability of s based on Ti

Euclidean distance of a significant pattern regarding to Ti and Tj

normalization factor

Local Grouping Phase--The GMPMine Algorithm

Step 1. Learning movement patterns for each object Step 2. Computing the pair-wise similarity core to

constructing a similarity graph Step 3. Partitioning the similarity graph for highly-

connected sub graphs Step 4. Choosing representative movement patterns

for each group

0 1 2 3 4

(a) Moving trajectories. (b) Location sequence data set at CHa.

ddbadddbadddbaddddaddddaddddaddddabdddabdddabdddabdddaaddddaacdddaacdddbacdddddaadddadddcabdddcabdddcaaddddabdddcaadddcaadd

ddccdddccddccddccddccddccddccddccddccdddccddddcddddddccddcdddcddcdddddddcdddcddcdddccddddcddcdd

ddccdddcaddddacddcadddcadddcacddcabdddadddcccddcccddccd ddccddddccdddcdddccddddaddddacddcaddddacdcccddccdddccd

o10: dcddddcdddcdddccddccddccdddcddccdddcdddcdddcddcccdcccddcccccdccdccdccdcccdcccdcccddcccddco11:

a:0.08b:0.33d:0.58

a:0.43d:0.57

a:0.21b:0.13 d:0.66

a:0.22b:0.08d:0.69

a:0.13b:0.50d:0.37

(a) PST of o4.

a:0.35b:0.18c:0.18d:0.29

a:0.25d:0.75

a:0.25b:0.06c:0.12 d:0.57

a:0.20b:0.04 c:0.20d:0.56

a:0.14b:0.03c:0.14d:0.70

c:0.67d:0.33

(b) PST of o5.

a:0.57b:0.14 d:0.29

a:0.32b:0.12d:0.56

a:0.6b:0.4

daa:0.6b:0.2 d:0.2

a:0.62d:0.38

d:1 ac

a:0.36b:0.07 c:0.36d:0.21

cddd:1

c:0.48d:0.52

c:0.41 d:0.59

dcc:0.9 d:0.1

c:0.38d:0.62

ddc:0.65d:0.35

d:1 cd

(c) PST of o8.

cc:0.17d:0.83

c:0.27 d:0.73

c:0.32d:0.67

ddc:0.5d:0.5

d:1 cd

(d) PST of o9.

Example of GMPMine

6 9 10

(a) Similarity graph.

6 9 10

(b) Highly connected subgraphs.

1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 1simpo4 ,o5 1.618simpo4 ,o9 1.067simpo8 ,o9 1.832

Inconsistency may exist among local grouping results Trajectory of a group may span cross several clusters Group relationships may vary at different locations A CH may have incomplete statistics …

A consensus function is required to combine multiple local grouping results remove inconsistency improve clustering quality improve stability

Global Ensembling Phase

Ga Gb Gc Gd

o0 -1 0 2 -1

o1 -1 1 2 -1

o2 -1 1 2 -1

o3 -1 0 2 -1

o4 1 2 0 0

o5 2 2 2 0

o6 2 -1 0 0

o7 2 -1 -1 0

o8 0 -1 -1 1

o9 0 3 1 1

o10 0 3 -1 1

o11 -1 -1 -1 1

Global Ensembling Phase (contd.) Normalized Mutual Information (NMI) is useful in measuring

the shared information between two grouping results

Given K local grouping results, the objective is to find a solution that keeps most information of the local grouping results

Pa |ga

gai gb

|O |HGi a 1

m i Pa log

join entropy

entropy

The CE Algorithm For a set of similarity thresholds D, we reformulate

our objective as

The CE algorithm includes three steps:1. Measuring the pair-wise similarity to construct a

similarity matrix by using Jaccard coefficient2. Generating the partitioning results for a set of thresholds

based on the similarity matrix3. Selecting the final ensembling result

G arg maxG , D

KNMIGi,G

Example of CE

o11o10o9o8o7o6

10.50.5o0

0.5o2o3o4o5

0.51o1

o3o2o1

o11o10o9o8o7o6

10.50.5o0

0.5o2o3o4o5

0.51o1

o3o2o1

1-1-1-1o11

1-130o10

1130o9

1-1-10o8

0-1-12o7

00-12o6

-120-1o0

-121-1o2-120-1o30021o40222o5

-121-1o1

GdGcGbGa

1-1-1-1o11

1-130o10

1130o9

1-1-10o8

0-1-12o7

00-12o6

-120-1o0

-121-1o2-120-1o30021o40222o5

-121-1o1

GdGcGbGa

(a) The labels of the local grouping results .

(b) Jaccard similarity matrix. (b) Highly connected subgraphs (δ=0.1).

(a) Similarity graph (δ=0.1).

6 9 10

δ Gδ ∑NMI(Gδ, Gi)

0.1 {{0,1,2,3},{5,6,7 },{8,9,10,11}} 2.322

0.2 {{0,1,2,3},{5,6,7 },{8,9,10,11}} 2.3220.3 {{0,1,2,3},{4,5,6,7 },{8,9,10,11}} 2.6360.4 {{0,1,2,3},{4,5,6,7 },{8,9,10}} 2.4010.5 {{0,1,2,3},{4,5,6,7 },{8,9,10}} 2.401

}50|10

Hsiao-Ping Tsai, De-Nian Yang, and Ming-Syan Chen, “Exploring Application Level Semantics for Data Compression,” accepted by IEEE Trans. on Knowledge and Data Engineering (TKDE), 2009

senderReceiver

A batch of data

Introduction Data transmission of is one of the most energy

expensive operations in WSNs A batch-and-send network

NAND flash memory

reduce network energy consumption

increase network throughput

Data compression is a paradigm in WSNs

However, few works address application-dependent semantics in data, such as the correlations of a group of moving objects

How to manage the location data for a group of objects?Compress data by general algorithms like Huffman? Compress a group of trajectory sequences simultaneously?

Motivation

Redundancy in a group of location sequences comes from two aspects Vertical redundancy Horizontal redundancy

Vertical

redundancy

relationships

Horizontal redundancy

Statistics of symbolsPredictability of symbols

What is Predictability of Symbols? With group movement patterns shared

Predict the next location (symbol)

senderReceiver

Movement Patterns

Replacing predictable items with a common symbol helps reduce entropy!

Problem Formulation Assume

A batch-based tracking network Group movement patterns are shared between a sender and a receiver

The Group Data Compression (GDC) ProblemGiven the group movement patterns of a group of objects, the GDC

problem is formulated as a merge problem and a hit item replace (HIR) problem to reduce the amount of bits required to represent their location sequences. The merge problem is to combine multiple location sequences to reduce the

overall sequence length the HIR problem targets to minimize the entropy of a sequence such that the

amount of data is reduced with or without loss of information

Our Approach The proposed two-phase and two-dimensional (2P2D)

algorithm Sequence merge phase

Utilizing the group relationships to merge the location data of a group of objects

Entropy reduction phase Utilizing the object movement patterns to reduce the entropy of the

merged data horizontally

senderReceiver

Movement Patterns

1100010111011000110011000111010101111001011011001001000000111111110001111101111000000101101101100000011101

Compressing…Uncompressing…

Compressibility is enhanced w/ or w/o information loss

Guarantee the reduction of entropy

We propose the Merge algorithm that avoids redundant reporting of their locations by trimming multiple

identical symbols into a single symbol chooses a qualified symbol to represent multiple symbols when a

tolerance of loss of accuracy is specified The maximal distance between the reported location and the real location is

below a specified error bound eb While multiple qualified symbols exist, we choose a symbol to minimize

the average location error

Sequence Merge Phase

o k b a p o jk f b n o c agfg aS0 k b0 1 4 5 6 7 98 10 11 13 14 18 191632 12 15 17

o k g k / f ab / o o k k k gp jg/S” p o0 1 4 5 6 7 98 10 11 13 14 18 19 20 21 22 2316 2432 12 15 17

o k b a p o gk f b p l b agfg aS1 k f

o k b a o k fg b b n o c agfk aS2 k f

b / b a / n n o k g / b f fo c

p l /25 26 29 30 31 32 3433 35 36 38 39 43 44 45 46 47412827 40 4237

b /c a

0 1 0 1 0 0 11 1 0 0 1 0 1000 1taglst0 1 0

0 1 4 5 6 7 98 10 11 13 14 18 191632 12 15 17

0 1 0 1 0 0 01 0 0 0 0 0 1000 1taglst1 0 0

0 1 0 1 0 1 00 0 0 0 1 0 1000 1taglst2 1 0

60 symbols -> 20 symbols

Entropy Reduction Phase

Group movement patterns carry the information about whether an item of a sequence is predictable

Since some items are predictable, extra redundancy exists

How to remove the redundancy and even increase the compressibility?

Entropy Reduction Phase According to Shannon’s theorem, the entropy is the upper

bound of the compression ratio Definition of entropy

Increasing the skewness of the data is able to reduce the entropy

eS ep0 ,p1 , . . . ,p | | 1 0 i | | pi log2pi.

eS e 116

, ..., 116

eS e 116

, ..., 116

0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 01 1

o k g g b a po k k j e o k k k c ca a0

taglst

simple

o . g g b . po k k . e . . k k c c. .kk k b n

0 1 4 5 6 7 98 10 11 13 14 18 19 20 21 22 2316 2432 12 15 17

e=2.883

e=2.752optimal o k g g b . po k k . e . k k k c c. .kk k b n e=2.718

0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 01 1

o k g g b a po k k j e o k k k c ca a0

taglst

simple

o . g g b . po k k j e o . k k c c. .kk k b n

0 1 4 5 6 7 98 10 11 13 14 18 19 20 21 22 2316 2432 12 15 17

e=2.883

e=2.963

The Hit Item Replacement (HIR) Problem A simple and intuitive method is to replace all predictable

symbols to increase the skewness However, the simple method can not guarantee to reduce the

entropy0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0 1 00 0

n o f b b a on o k i j n k k k j gb c0

taglst

simple

n . f b b . .n o . i j n k k k . gb ck. . . b

0 1 4 5 6 7 98 10 11 13 14 18 19 20 21 22 2316 2432 12 15 17

e=3.053

e=2.854

Definition of the Hit Item Replace (HIR) problem:

Given a sequence and the information about whether each item is predictable, the HIR problem is to decide whether to replace each of the predicted symbols in the given sequence with a hit symbol to minimize the entropy of the sequence.

Three Rules1. Accumulation rule:

2. Concentration rule:

3. Multi-symbol rule:

For s, if n nhit , replace all items of .

For s, if n n. or nhit n n. ,replace all predictable items of .

For a set of predictable symbols s, if gains 0,replace all predictable items of s.

n : The number of items of in S.nhit : The number of predictable items of in S.

s : A sub-set of that contains all predictable symbols in S.

Example of the Replace Algorithm

0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0 1 00 0

n o f b b a on o k i j n k k k j gb c0

jtaglst 1