25.11.2009 1 Data Mining MTAT.03.183 (4AP = 6EAP) Streams, time series Jaak Vilo 2009 Fall Summary so far • Data preparation • Machine learning • Statistics/significance • Large data – algorithmics • Visualisation • Queries/reporting, OLAP • Different types of data • Business value Jaak Vilo and other authors UT: Data Mining 2009 2 Streams, time series • Time • Sequence order and position • Continuosly arriving data Jaak Vilo and other authors UT: Data Mining 2009 3 Wikipedia • Data Stream Mining is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities. Examples of data streams include computer network traffic, phone conversations, ATM transactions, web searches, and sensor data. Data stream mining can be considered a subfield of data mining , machine learning , and knowledge discovery . Jaak Vilo and other authors UT: Data Mining 2009 4 • In many data stream mining applications, the goal is to predict the class or value of new instances in the data stream given some knowledge about the class membership or values of previous instances in the data stream. Machine learning techniques can be used to learn this prediction task from labeled examples in an automated fashion. In many applications, the distribution underlying the instances or the rules underlying their labeling may change over time, i.e. the goal of the prediction, the class to be predicted or the target value to be predicted, may change over time. This problem is referred to as concept drift . Jaak Vilo and other authors UT: Data Mining 2009 5 Software • RapidMiner : free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining learning time‐varying concepts and mining, learning time varying concepts, and tracking drifting concept (if used in combination with its data stream mining plugin (formerly: concept drift plugin)) Jaak Vilo and other authors UT: Data Mining 2009 6
20
Embed
DataMining MTAT.03.183 (4AP 6EAP) Streams, time series · PDF filetechniques can be used to learn this prediction task from labeled ... • Built in D2K as D2K modules and leveraged
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
25.11.2009
1
Data Mining MTAT.03.183
(4AP = 6EAP)Streams, time series,
Jaak Vilo
2009 Fall
Summary so far
• Data preparation
• Machine learning
• Statistics/significance
• Large data – algorithmics
• Visualisation
• Queries/reporting, OLAP
• Different types of data
• Business valueJaak Vilo and other authors UT: Data Mining 2009 2
Streams, time series
• Time
• Sequence order and position
• Continuosly arriving data
Jaak Vilo and other authors UT: Data Mining 2009 3
Wikipedia
• Data Stream Mining is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities. Examples of data streams include computer network traffic, phone conversations, ATM transactions, web searches, and sensor data. Data stream mining can be considered a subfield of data mining, machine learning, and knowledge discovery.
Jaak Vilo and other authors UT: Data Mining 2009 4
• In many data stream mining applications, the goal is to predict the class or value of new instances in the data stream given some knowledge about the class membership or values of previous instances in the data stream. Machine learning techniques can be used to learn this prediction task from q plabeled examples in an automated fashion. In many applications, the distribution underlying the instances or the rules underlying their labeling may change over time, i.e. the goal of the prediction, the class to be predicted or the target value to be predicted, may change over time. This problem is referred to as concept drift.
Jaak Vilo and other authors UT: Data Mining 2009 5
Software
• RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining learning time‐varying concepts andmining, learning time varying concepts, and tracking drifting concept (if used in combination with its data stream mining plugin (formerly: concept drift plugin))
Jaak Vilo and other authors UT: Data Mining 2009 6
25.11.2009
2
• MOA (Massive Online Analysis): free open‐source software specific for mining datastreams with concept drift. It contains a prequential evaluation method the EDDMprequential evaluation method, the EDDM concept drift methods, a reader of ARFF real datasets, and artificial stream generators asSEA concepts, STAGGER, rotating hyperplane, random tree, and random radius basedfunctions. MOA supports bi‐directionalinteraction with Weka (machine learning).
Jaak Vilo and other authors UT: Data Mining 2009 7
On-Line analysis of streams Clustering data streams Classification of data streams Mining frequent patterns in data streams g q p Mining sequential patterns in data streams Mining partial periodicity in data streams Mining outliers and unusual patterns in data
streams ……
Clustering on Streams
K-means - not suitable for stream mining
Clustream- assume shape of the cluster is always assume shape of the cluster is always
circle. Denstream
- detects arbitrary shape clusters in stream data.
Frequent Pattern Mining (FPM) in data streams.Frequent (/hot/top) patterns:Items/Item sets/Sequences occurring, frequently in a database.
ISSUES
Frequent Pattern Mining (FPM)in data streams
-Limited memory
-Reading past data is impossible.
Question: How much is it justified to mine frequent pattern only in data stream??
Infrequent pattern mining
Objective:1. To find-out the abnormality , surprising or
“interesting” pattern in the data stream.2. Mutual pattern mining.3 Stream specific item set mining3. Stream specific item set mining.4. Association Rule mining among event of interest.
Application:1. Text mining.2. Distributed Sensor Networks.3. Works well for evolving data stream.
25.11.2009
4
Challenges in Stream Data Analysis
• Data Volume is Huge• Need to remember recent and historical data• Approaches to data reduction• Need single linear scan algorithms• Most existing algorithms and prototype systems are
memory and CPU bound and can only perform a single memory and CPU bound, and can only perform a single data mining function
• Desire to perform multiple analysis at the same time• Occurrence of concept drifts where previous model is
no longer valid• Reduce the cost of learning where models need to be
updated and replaced• Require instant response
Loretta Auvil
Stream Data Reduction
• Challenges of “OLAP-ing” stream data• Raw data cannot be stored• Simple aggregates are not powerful enough• History shape and patterns at different levels are desirable
• MAIDS Unique Approach• A tilted time window to aggregate data at different points
in time• A scalable multi-dimensional stream data cube that can
aggregate a model of stream data efficiently without accessing the raw data
Loretta Auvil
MAIDS Approach: Tilted Time Window
• Recent data is registered and weighted at a finer granularity than longer term data
• As the edge of a time widow is reached, the finer granularity data is summarized and propagated to a courser granularity
• Window is maintained automatically
24hrs 4qtrs 15minutes7days
Past
30sec
TimePresent
Loretta Auvil
MAIDS: Stream Mining Architecture
MAIDS is aimed to:• Discover changes,
trends and evolution characteristics in data streams
• Construct clusters and classification models from data streams
• Explore frequent patterns and similarities among data streams
Loretta Auvil
Features of MAIDS
• General purpose tool for data stream analysis• Processes high-rate and multi-dimensional data• Adopts a flexible tilted time window framework• Facilitates multi-dimensional analysis using a stream
cube architecturecube architecture• Integrates multiple data mining functions• Provides user-friendly interface: automatic analysis and
on-demand analysis• Facilitates setting alarms for monitoring• Built in D2K as D2K modules and leveraged in the D2K
Streamline tool
Loretta Auvil
Statistics Query Engine
• Answers user queries on data statistics, such as, count, max, min, average, regression, etc.
U tilt d ti • Uses tilted time window
• Uses an efficient data structure, H-tree for partial computation of data cubes
Loretta Auvil
25.11.2009
5
Stream Data Classifier
• Builds models to make predictions
• Uses Naïve Bayesian Classifier with boosting
• Uses Tilted Time Window to track time related info
• Sets alarm to monitor events
Loretta Auvil
Stream Pattern Finder
• Find frequent patterns with multiple time granularities
• Keep precise/ compressed history in tilted time windowtilted time window
• Mine only the interested item set using FP-tree algorithm
• Mining evolution and dramatic changes of frequent patterns
Loretta Auvil
Stream Data Clustering
• Two stages: micro-clustering and macro-clustering
• Uses micro-clustering to do incremental, online processing and online processing and maintenance
• Uses tilted time frame
• Detects outliers when new clusters are formed
Loretta Auvil
Demonstration
Loretta Auvil
Significant Advances In the Areas of Data Management and Mining
• Tilted-time window for multi-resolution modeling• Multi-dimensional analysis using a stream cube architecture• Efficient “one-look” stream data mining algorithms:
• classification, frequent pattern analysis, clustering, and information visualization
• Integration of “one-look” approaches into one stream data mining platform so they can cooperate to discover patterns and surprising platform so they can cooperate to discover patterns and surprising events in real-time
• Internationally recognized research leadership in the areas of data management, mining, and knowledge sharing
• Experience in development of robust software framework supporting advanced, data mining and information visualization
• Experience in development of software environments supporting problem solving and evidence-based decision making
Loretta Auvil
Knowledge Extraction from Streaming Text
Information extraction• process of using advanced
automated machine learning approaches
• to identify entities in text documents
• extract this information along with the relationships these pentities may have in the text documents
This project demonstrates information extraction of names, places and organizations from real-time news feeds. As news articles arrive, the information is extracted and displayed.
Queries are often continuous Evaluated continuously as stream data arrives
Ans e pdated o e time
November 25, 2009 Data Mining: Concepts and Techniques 31
Answer updated over time
Queries are often complex Beyond element-at-a-time processing
Beyond stream-at-a-time processing
Beyond relational queries (scientific, data mining, OLAP)
Multi-level/multi-dimensional processing and data mining Most stream data are at low-level or multi-dimensional in nature
Processing Stream Queries
Query types One-time query vs. continuous query (being evaluated
continuously as stream continues to arrive) Predefined query vs. ad-hoc query (issued on-line)
Unbounded memory requirements
November 25, 2009 Data Mining: Concepts and Techniques 32
For real-time response, main memory algorithm should be used
Memory requirement is unbounded if one will join future tuples
Approximate query answering With bounded memory, it is not always possible to produce exact
answers
High-quality approximate answers are desired Data reduction and synopsis construction methods
Sketches, random sampling, histograms, wavelets, etc.
Methodologies for Stream Data Processing
Major challenges Keep track of a large universe, e.g., pairs of IP address, not ages
Methodology Synopses (trade-off between accuracy and storage) Use synopsis data structure, much smaller (O(logk N) space) than
their base data set (O(N) space)
November 25, 2009 Data Mining: Concepts and Techniques 33
Compute an approximate answer within a small error range(factor ε of the actual answer)
Major methods Random sampling Histograms Sliding windows Multi-resolution model Sketches Radomized algorithms
Stream Data Processing Methods (1)
Random sampling (but without knowing the total length in advance)
Reservoir sampling: maintain a set of s candidates in the reservoir, which form a true random sample of the element seen so far in the stream. As the data stream flow, every new element has a certain probability (s/N) of replacing an old element in the reservoir.
Sliding windows
M k d i i b d l t d t f lidi i d i
November 25, 2009 Data Mining: Concepts and Techniques 34
Make decisions based only on recent data of sliding window size w An element arriving at time t expires at time t + w
Histograms
Approximate the frequency distribution of element values in a stream
Partition data into a set of contiguous buckets
Equal-width (equal value range for buckets) vs. V-optimal (minimizing frequency variance within each bucket)
Multi-resolution models
Popular models: balanced binary trees, micro-clusters, and wavelets
Stream Data Processing Methods (2) Sketches
Histograms and wavelets require multi-passes over the data but sketches can operate in a single pass
Frequency moments of a stream A = {a1, …, aN}, Fk:where v: the universe or domain size, mi: the frequency of i in the sequence
Given N elts and v values, sketches can approximate F0, F1, F2 in O(log v + log N) space
v
i
kik mF
1
November 25, 2009 Data Mining: Concepts and Techniques 35
O(log v + log N) space
Randomized algorithms
Monte Carlo algorithm: bound on running time but may not return correct result
Chebyshev’s inequality:
Let X be a random variable with mean μ and standard deviation σ
Chernoff bound:
Let X be the sum of independent Poisson trials X1, …, Xn, δ in (0, 1]
The probability decreases expoentially as we move from the mean
2
2
)|(|k
kXP
4/2
|])1([ eXP
Approximate Query Answering in Streams
Sliding windows Only over sliding windows of recent stream data Approximation but often more desirable in applications
Batched processing, sampling and synopses Batched if update is fast but computing is slow
Comp te pe iodicall not e timel
November 25, 2009 Data Mining: Concepts and Techniques 36
Compute periodically, not very timely Sampling if update is slow but computing is fast
Compute using sample data, but not good for joins, etc. Synopsis data structures
Maintain a small synopsis or sketch of data Good for querying historical data
Blocking operators, e.g., sorting, avg, min, etc. Blocking if unable to produce the first output until seeing the entire
MAIDS MAIDS (UIUC/NCSA): Mining Alarming Incidents in Data Streams
Stream Data Mining vs. Stream Querying
Stream mining—A more challenging task in many cases It shares most of the difficulties with stream querying
But often requires less “precision”, e.g., no join, grouping, sorting
Patterns are hidden and more general than querying
November 25, 2009 Data Mining: Concepts and Techniques 38
It may require exploratory analysis Not necessarily continuous queries
Stream data mining tasks Multi-dimensional on-line analysis of streams Mining outliers and unusual patterns in stream data Clustering data streams Classification of stream data
Concept drift
• In many applications, the distribution underlying the instances or the rules underlying their labeling may change over time i e the goal of the prediction the classtime, i.e. the goal of the prediction, the class to be predicted or the target value to be predicted, may change over time. This problem is referred to as concept drift.
November 25, 2009 Data Mining: Concepts and Techniques 39
Episode Rules
• Association rules applied to sequences of events.
• Episode – set of event predicates and partial ordering on them
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
•• Association rules describe how things occur together in Association rules describe how things occur together in the datathe data
– E.g., "IF an alarm has certain properties, THEN it will have other given properties"
BasicsBasics
Course on Data MiningCourse on Data Mining 41Page41/54
have other given properties
•• Episode rules describe temporal relationships between Episode rules describe temporal relationships between thingsthings
– E.g., "IF a certain combination of alarms occurs within a time period, THEN another combination of alarms will occur within a time period"
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Network Management SystemNetwork Management System
MSC MSCMSCMSC
Switched NetworkSwitched Network
BasicsBasics
Course on Data MiningCourse on Data Mining 42Page42/54
BSC BSCBSCBSC
BTSBTS BTSBTSBTSBTS
Access NetworkAccess Network
MSCMSC
BSCBSC
BTSBTS
Base station controller
Base station transceiver
Mobile station controller
Ala
rms
25.11.2009
8
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
BasicsBasics
•• As defined earlier, telecom data contains alarms:As defined earlier, telecom data contains alarms:
1234 EL1 PCM 940926082623 A1 ALARMTEXT..
Alarm type Date, time Alarm severity class
Course on Data MiningCourse on Data Mining 43Page43/54
•• Now we forget about relationships between attributes Now we forget about relationships between attributes within alarms as with the association ruleswithin alarms as with the association rules
•• We just take the alarm number attribute, handle it here We just take the alarm number attribute, handle it here as event/alarm type and inspect the relationships as event/alarm type and inspect the relationships between events/alarmsbetween events/alarms
Alarm numberAlarming network element
Episodes
• Partially ordered set of pages
• Serial episode – totally ordered with time constraint
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
•• Data:Data:– Data is a set R of events– Every event is a pair (A, t), where
• A R is the event type (e g alarm type)
BasicsBasics
Course on Data MiningCourse on Data Mining 46Page46/54
• A R is the event type (e.g., alarm type)• t is an integer, the occurrence time of the event
– Event sequence s on R is a triple (s, Ts, Te)• Ts is starting time and Te is ending time• Ts < Te are integers• s = (A1, t1), (A2, t2), …, (An, tn) • Ai R and Ts ti < Te for all i=1, …, n
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
•• Example alarm data sequence:Example alarm data sequence:
Course on Data MiningCourse on Data Mining 47Page47/54
•• Here:Here:– A, B, C and D are event (or here alarm) types– 10…150 are occurrence times– s = (D, 10), (C, 20), …, (A, 150) – Ts (starting time) = 10 and Te (ending time) = 150
•• Note: There needs Note: There needs notnot to be events on every time slot!to be events on every time slot!
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
•• Episodes:Episodes:
– An episode is a pair (V, )
• V is a collection of event types, e.g., alarm types
BasicsBasics
Course on Data MiningCourse on Data Mining 48Page48/54
• is a partial order on V
– Given a sequence S of alarms, an episode = (V, )occurs within S if there is a way of satisfying the event types (e.g., alarm types) in V using the alarms of S so that the partial order is respected
– Intuitively: episodes consist of alarms that have certain properties and occur in a certain partial order
25.11.2009
9
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
•• The most useful partial orders are:The most useful partial orders are:
– Total orders
• The predicates of each episode have a fixed order
h i d ll d l ( d d )
BasicsBasics
Course on Data MiningCourse on Data Mining 49Page49/54
• Such episodes are called serial (or "ordered")
– Trivial partial orders
• The order of predicates is not considered
• Such episodes are called parallel (or "unordered")
•• Complicated?Complicated?
– Not really, let's take some clarifying examples
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
•• Examples:Examples:
BasicsBasics
A B A A
Course on Data MiningCourse on Data Mining 50Page50/54
Serial episode
B
Parallel episode
B
C
More complex episode with
serial and parallel
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
•• The name of the WINEPI method comes from the The name of the WINEPI method comes from the technique it uses: a sliding windowtechnique it uses: a sliding window
•• Intuitively: Intuitively:
– A window is slided through the event-based data
WINEPI ApproachWINEPI Approach
Course on Data MiningCourse on Data Mining 51Page51/54
A window is slided through the event-based data sequence
– Each window "snapshot" is like a row in a database
– The collection of these "snapshots" forms the rows in the database
•• Complicated?Complicated?
– Not really, let's take a clarifying example
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
•• Example alarm data sequence:Example alarm data sequence:
WINEPI ApproachWINEPI Approach
Course on Data MiningCourse on Data Mining 52Page52/54
0 10 20 30 40 50 60 70 80 90
D C A B D A B C
•• The window width is 40 seconds, last point excluded The window width is 40 seconds, last point excluded •• The first/last window contains only the first/last eventThe first/last window contains only the first/last event
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Formally, given a set E of event types an event sequence an event sequence S S = (= (ss,,TTss,T,Tee)) is an ordered sequence of events eventi such that eventi eventi+1 for all i=1, …, n-1, and Ts eventi < Te for all i=1, …, n
WINEPI ApproachWINEPI Approach
Course on Data MiningCourse on Data Mining 53Page53/54
e , ,
Ts Te
t1 t2 t3 … … tn
event1 event2 event3 … … eventn
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Formally, a windowwindow on event sequence S is an event sequence S=(w,ts,te), where ts < Te, te > Ts, and w consists of those pairs (event, t) from s where ts t < te
• The value t t < t is called window width W
WINEPI ApproachWINEPI Approach
Course on Data MiningCourse on Data Mining 54Page54/54
• The value ts t < te is called window width, W
Ts Te
t1 t2 t3 tn
event1 event2 event3 … … eventn
WWttss ttee
25.11.2009
10
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• By definition, the first and the last windows on a sequence extend outside the sequence, so that the last window contains only the first time point of the sequence, and the last window only the last time point
WINEPI ApproachWINEPI Approach
Course on Data MiningCourse on Data Mining 55Page55/54
y p
WWttss ttee
Ts Te
t1 t2 t3 tn
event1 event2 event3 … … eventn
WWttss ttee
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• The frequencyfrequency (cf. support with association rules) of an episode is the fraction of windows in which the episode occurs, i.e.,
WINEPI ApproachWINEPI Approach
Course on Data MiningCourse on Data Mining 56Page56/54
|Sw W(S, W) | occurs in Sw |fr(, S, W) =
|W(S, W)|
where W(S, W) is the set of all windows Sw of sequence S such that the window width is W
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• When searching for the episodes, a frequency thresholdfrequency threshold(cf. support threshold with association rules) min_fr is used
• Episode is frequent if fr(, s, win) min_fr, i.e, "if the freq enc of e ceeds the minim m freq enc threshold
WINEPI ApproachWINEPI Approach
Course on Data MiningCourse on Data Mining 57Page57/54
frequency of exceeds the minimum frequency threshold within the data sequence s and with window width win"
• F(s, win, min_fr): a collection of frequent episodes in swith respect to win and min_fr
•• Apriori trick holds:Apriori trick holds: if an episode is frequent in an event sequence s, then all subepisodes are frequent
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
•• FormallyFormally, an episode rule is as expression , where and are episodes such that is a subepisode of
• An episode is a subepisode of ( ), if the graph representation is a subgraph of the representation of
WINEPI ApproachWINEPI Approach
Course on Data MiningCourse on Data Mining 58Page58/54
representation is a subgraph of the representation of
A
B
:
A
B
C:
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• The fraction
fr(, S, W) = frequency of the whole episode
fr( S W) = frequency of the LHS episode
WINEPI ApproachWINEPI Approach
Course on Data MiningCourse on Data Mining 59Page59/54
fr(, S, W) frequency of the LHS episode
is the confidenceconfidence of the WINEPI episode rule
• The confidence can be interpreted as the conditional probability of the whole of occurring in a window, given that occurs in it
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
•• Intuitively: Intuitively:
– WINEPI rules are like association rules, but with an additional time aspect:
If e ents (alarms) satisf ing the r le antecedent (left
WINEPI ApproachWINEPI Approach
Course on Data MiningCourse on Data Mining 60Page60/54
If events (alarms) satisfying the rule antecedent (left-hand side) occur in the right order within W time units, then also the rule consequent (right-hand side) occurs in the location described by , also within W time units
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
•• InputInput: A set R of event/alarmtypes, an event sequence s over R, a set Eof episodes, a window width win, and a frequency threshold min_fr
•• OutputOutput: The collection F(s, win, min_fr)•• MethodMethod:
1. compute C1 := { E | || = 1};
WINEPI AlgorithmWINEPI Algorithm
Course on Data MiningCourse on Data Mining 61Page61/54
p 1 { | | | };2. i := 1;3. while Ci do4.(* compute F(s, win, min_fr) := { Ci | fr(, s, win) min_fr};5. i := l+1;6.(** compute Ci:= { E | || = I, and F||(s, win, min_fr) for
all E, };
(* = database pass, (** candidate generation
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• First problem: given a sequence and a episode, find out whether the episode occurs in the sequence
• Finding the number of windows containing an occurrence of the episode can be reduced to this
WINEPI AlgorithmWINEPI Algorithm
Course on Data MiningCourse on Data Mining 62Page62/54
• Successive windows have a lot in common• How to use this?
– An incremental algorithm– Same idea as for association rules– A candidate episode has to be a combination of two episodes of
smaller size– Parallel episodes, serial episodes
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
•• Parallel episodes:Parallel episodes:
– For each candidate maintain a counter .event_count: how many events of are present in the window
When t t becomes eq al to || indicating
WINEPI AlgorithmWINEPI Algorithm
Course on Data MiningCourse on Data Mining 63Page63/54
– When .event_count becomes equal to ||, indicating that is entirely included in the window, save the starting time of the window in .inwindow
– When .event_count decreases again, increase the field .freq_count by the number of windows where remainded entirely in the window
•• Serial episodes: use a state automataSerial episodes: use a state automata
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
•• Example alarm data sequence:Example alarm data sequence:
WINEPI ApproachWINEPI Approach
Course on Data MiningCourse on Data Mining 64Page64/54
0 10 20 30 40 50 60 70 80 90
D C A B D A B C
•• The window width is 40 secs, The window width is 40 secs, movement stepmovement step 10 secs 10 secs •• The length of the sequence is 70 secs (10The length of the sequence is 70 secs (10--80)80)
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
•• By sliding the window, we'll get 11 windows (UBy sliding the window, we'll get 11 windows (U11--UU1111): ):
WINEPI ApproachWINEPI Approach
U2
…
Course on Data MiningCourse on Data Mining 65Page65/54
0 10 20 30 40 50 60 70 80 90
D C A B D A B C
•• Frequency threshold is set to 40%, i.e., an episode has Frequency threshold is set to 40%, i.e., an episode has to occur at least in 5 of the 11 windowsto occur at least in 5 of the 11 windows
U1
U2U11
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
WINEPI ApproachWINEPI Approach
Course on Data MiningCourse on Data Mining 66Page66/54
25.11.2009
12
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
•• Suppose that the task is to find all parallel episodes:Suppose that the task is to find all parallel episodes:– First, create singletons, i.e., parallel episodes of size 1 (A, B, C, D)
– Then, recognize the frequent singletons (here all are)
– From those frequent episodes, build candidate episodes of size 2:
WINEPI ApproachWINEPI Approach
Course on Data MiningCourse on Data Mining 67Page67/54
From those frequent episodes, build candidate episodes of size 2: AB, AC, AD, BC, BD, CD
– Then, recongize the frequent parallel episodes (here all are)
– From those frequent episodes, build candidate episodes of size 3: ABC, ABD, ACD, BCD
– When recognizing the frequent episodes, only ABD occurs in more than four windows
– There are no candidate episodes of size four
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
•• Episode frequencies and example rules with WINEPI:Episode frequencies and example rules with WINEPI:
D : 73%C : 73%A : 64%
WINEPI ApproachWINEPI Approach
Course on Data MiningCourse on Data Mining 68Page68/54
B : 64% D A [40] (55%, 75%)D C : 45%D A : 55%D B : 45% D A B [40] (45%, 82%)C A : 45%C B : 45%A B : 45%D A B : 45%
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Course on Data MiningCourse on Data Mining 69Page69/54
– Parallel and serial episodes
– Window widths (W) 10-120 seconds
– Window movement = W/10
– min_fr = 0.003 (0.3%), frequent: about 100 occurrences
– 90 MHz Pentium, 32MB memory, Linux operating system. The data resided in a 3.0 MB flat text file
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
•• One shortcoming in WINEPI approach:One shortcoming in WINEPI approach:
– Consider that two alarms of type A and one alarm of type B occur in a window
Does the parallel episode consisting of A and B appear
WINEPI ApproachWINEPI Approach
Course on Data MiningCourse on Data Mining 71Page71/54
– Does the parallel episode consisting of A and B appear once or twice?
– If once, then with which alarm of type A?
0 10 20 30 40 50 60 70 80 90
D C A B D A B C
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
•• Alternative approach to discovery of episodesAlternative approach to discovery of episodes
– No sliding windows
– For each potentially interesting episode, find out the e act occ rrences of the episode
MINEPI ApproachMINEPI Approach
Course on Data MiningCourse on Data Mining 72Page72/54
exact occurrences of the episode
•• Advantages:Advantages: easy to modify time limits, several time limits for one rule:
"If A and B occur within 15 seconds, then C follows within 30 seconds"
•• Disadvantages:Disadvantages: uses a lots of space
25.11.2009
13
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Formally, given a episode and an event sequence S, the interval [ts,te] is a minimal occurrenceminimal occurrence of S,
– If occurs in the window corresponding to the interval
If does not occ r in an proper s binter al
MINEPI ApproachMINEPI Approach
Course on Data MiningCourse on Data Mining 73Page73/54
– If does not occur in any proper subinterval
• The set of minimal occurrencesset of minimal occurrences of an episode in a given event sequence is denoted by mo():
mo() = { [ts,te] | [ts,te] is a minimal occurrence of }
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Example: Parallel episode consisting of event types Aand B has three minimal occurrences in s: {[30,40], [40,60], [60,70]}, has one occurrence in s: {[60,80]}
MINEPI ApproachMINEPI Approach
Course on Data MiningCourse on Data Mining 74Page74/54
D C A B D A B C
0 10 20 30 40 50 60 70 80 90
A
B
A
BC: :
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
•• InformallyInformally, a MINEPI episode rule gives the conditional probability that a certain combination of events (alarms) occurs within some time bound, given that another combi-nation of events (alarms) has occurred within a time bound
MINEPI ApproachMINEPI Approach
Course on Data MiningCourse on Data Mining 75Page75/54
( )
• Formally, an episode ruleepisode rule is [win1] [win2]
• and are episodes such that ( is a subepisode of )
• If episode has a minimal occurrence at interval [ts,te] with te - ts win1, then episode occurs at interval [ts,t'e] for some t'e such that t'e - ts win2
Pattern Discovery
1. Choose the language (formalism) to represent the patterns (search space)
2. Choose the rating for patterns, to tell which is “better” than others
3. Design an algorithm that finds the best
patterns from the pattern class, fast.
Brazma A, Jonassen I, Eidhammer I, Gilbert D.Approaches to the automatic discovery of patterns in biosequences.J Comput Biol. 1998;5(2):279-305.
Level 1
Level 2
Level 3
Eukaryotic genome can be thought of as six Levels of DNA structure.
The loops at L l 4
Level 0 ATCGCTGAATTCCAATGTG
Level 4
Level 5
Level 6
Level 4 range from 0.5kb to 100kb in length.
If these loops were stabilized then the genes inside the loop would not be expressed.
DNAGenBank / EMBL Bank
ProteinSwissProt/TrEMBL
StructurePDB/Molecular Structure Database
DNA determines function (?)
4 Nucleotides 20+ Amino Acids(3nt 1 AA)
Function?
25.11.2009
14
A Simple Gene
A: B: C:
ATCGAAATTAGCTTTA
+Modifications
Upstream/promoter
Downstream
DNA:
Species and individuals
• Animals, plantsfungi, bacteria, …
S i• Species
• Individuals
www.tolweb.org
Gene Regulatory Signal Finding
Transcription Factor
Transcription Factor Binding Site
Goal: Detect Transcription Factor Binding Sites.Eleazar Eskin: Columbia Univ.
YOR261C YOR261C RPN8 protein degradation 26S proteasome regulatory subunit S0005787 1YDL020C YDL020C RPN4 protein degradation, ubiquitin26S proteasome subunit S0002178 1YDL007W YDL007W RPT2 protein degradation 26S proteasome subunit S0002165 1YDL147W YDL147W RPN5 protein degradation 26S proteasome subunit S0002306 1YOL038W YOL038W PRE6 protein degradation 20S proteasome subunit (alpha4) S0005398 1YKL145W YKL145W RPT1 protein degradation, ubiquitin26S proteasome subunit S0001628 1YDL097C YDL097C RPN6 protein degradation 26S proteasome regulatory subunit S0002255 1YDR394W YDR394W RPT3 protein degradation 26S proteasome subunit S0002802 1YBR173C YBR173C UMP1 t i d d ti bi iti 20S t t ti f t S0000377 1
GGTGGCAA - proteasome associated control element
YBR173C YBR173C UMP1 protein degradation, ubiquitin20S proteasome maturation factor S0000377 1YER012W YER012W PRE1 protein degradation 20S proteasome subunit C11(beta4) S0000814 1YPR108W YPR108W RPN7 protein degradation 26S proteasome regulatory subunit S0006312 1YOR117W YOR117W RPT5 protein degradation 26S proteasome regulatory subunit S0005643 1YJL001W YJL001W PRE3 protein degradation 20S proteasome subunit (beta1) S0003538 1YPR103W YPR103W PRE2 protein degradation 20S proteasome subunit (beta5) S0006307 1YOR157C YOR157C PUP1 protein degradation 20S proteasome subunit (beta2) S0005683 1YGL048C YGL048C RPT6 protein degradation 26S proteasome regulatory subunit S0003016 1YHR200W YHR200W RPN10 protein degradation 26S proteasome subunit S0001243 1YML092C YML092C PRE8 protein degradation 20S proteasome subunit Y7 (alpha2 S0004557 1YIL075C YIL075C RPN2 tRNA processing 26S proteasome subunit) S0001337 1YMR314W YMR314W PRE5 protein degradation 20S proteasome subunit(alpha6) S0004931 1YGR253C YGR253C PUP2 protein degradation 20S proteasome subunit(alpha5) S0003485 1YGR135W YGR135W PRE9 protein degradation 20S proteasome subunit Y13 (alpha3) S0003367 1YFR004W YFR004W RPN11 transcription putative global regulator S0001900 1YOR259C YOR259C RPT4 protein degradation 26S proteasome regulatory subunit S0005785 1YFR052W YFR052W RPN12 protein degradation 26S proteasome regulatory subunit S0001948 1YFR050C YFR050C PRE4 protein degradation proteasome subunit, B type S0001946 1YGL011C YGL011C SCL1 protein degradation 20S proteasome subunit YC7ALPHA/Y8 S0002979 1YDR427W YDR427W RPN9 protein degradation 26S proteasome regulatory subunit S0002835 1YOR362C YOR362C PRE10 protein degradation 20S proteasome subunit C1 (alpha7) S0005889 1YBL041W YBL041W PRE7 protein degradation 20S proteasome subunit S0000137 1YER021W YER021W RPN3 protein degradation 26S proteasome regulatory subunit S0000823 1YER094C YER094C PUP3 protein degradation 20S proteasome subunit (beta3 S0000896 1YGR270W YGR270W YTA7 protein degradation 26S proteasome subunit; ATPase S0003502 1YHR027C YHR027C RPN1 protein degradation 26S proteasome regulatory subunit S0001069 1YER047C YER047C SAP1 mating type switching AAA family protein S0000849 1YGR232W YGR232W unknown unknown S0003464 1
Jaak Vilo and other authors UT: Data Mining 2009 94
SPEXS: count and memorize
i...v....x....v....xabracadabradadabraca
aa{1,4,6,8,11,13,15,18,20}
{2,5,7,9,12,14,16,19,21}
SPEXS: extend …
a
i...v....x....v....xabracadabradadabraca
a
c
{5,19}
b
{2,9,16}
d
{7,12,14}
{2,5,7,9,12,14,16,19,21}
25.11.2009
17
SPEXS: find frequent first
a
i...v....x....v....xabracadabradadabraca
a
b
{2,9,16}
d
{7,12,14}
{2,5,7,9,12,14,16,19,21}
SPEXS: group positions
a
i...v....x....v....xabracadabradadabraca
a
b
{2,9,16}
d
{7,12,14}
{2,5,7,9,12,14,16,19,21}
[bd]
{2,7,9,12,14,16}
.
The wildcards
GCAT.{3,6}X
The wildcards
GCAT.*X
The wildcards: not too many
aw:0
w:0
.{3.6}bw:1w:0
SPEXS: general algorithm
1. S = input sequences ( ||S||=n )2. e = empty pattern, e.pos = {1,...,n}3. enqueue( order , e )
4. while p = dequeue( order ) 5. generate all allowed extensions p’ of p (& p’.pos)g p p ( p p )6. enqueue( order, p’, priority(p’) ) 7. enqueue( output, p’, fitness(p’) )
8. while p = dequeue( output )9. Output p
Jaak Vilo: Discovering Frequent Patterns from Strings.Technical Report C-1998-9 (pp. 20) May 1998. Department of Computer Science, University of Helsinki.
Jaak Vilo: Pattern Discovery from Biosequences PhD Thesis, Department of Computer Science, University of Helsinki, Finland. Report A-2002-3 Helsinki, November 2002, 149 pages
• “Lazy suffix tree construction”‐like algorithm (Kurtz, Giegerich)
• Analyze multiple sets of sequences simultaneously
• Restrict search to most frequent patterns only (in each set)
• Reportmost frequent patterns, patterns over‐ or underrepresented in selected subsets, or patterns significant by various statistical criteria, e.g. by binomial distribution
30min
Multiple data setsD1 D2 D3
4/3 (6) 3/3 (12) 2/2 (9)
.G.GATGAG.T. 39 seq
.G.GATGAG.T. 39 seq (vs 193) p= 2.5e-33
-1: .G.GATGAG.T. 61 seq (vs 1292)
-1: .G.GATGAG.T. 61 seq (vs 1292) p= 1.4e-19
-2: .G.GATGAG.T. 91 seq
-2: .G.GATGAG.T. 91 seq (vs 5464)
-3: .G.GATGAG.T. 98 seq
25.11.2009
19
Jaak Vilo: Pattern Discovery from Biosequences PhD Thesis, Department of Computer Science, University of Helsinki, FinlandSeries of Publications A, Report A-2002-3 Helsinki, November 2002, 149 pages
-2: .G.GATGAG.T. 91 seq
These hits result in a PWM:
PWM based on all previous hits, here shown highest-scoring occurrences in blue
All against all
approximate matching
For every subsequence of every sequence
Match approximately against all the the sequences.
Approximate hits define PWM matrices (not all positions vary equally).
Look for ALL PWM-s derived from data that are enriched in data set (vs. background).
Hendrik Nigul, Jaak Vilo
Dynamic programming
• Small nr of edit operations allows to limit the search efficiently around main diagonal