DataMining MTAT.03.183 (4AP 6EAP) Streams, time series · PDF filetechniques can be used to learn this prediction task from labeled ... • Built in D2K as D2K modules and leveraged

25.11.2009

1

Data Mining MTAT.03.183

(4AP = 6EAP)Streams, time series,

Jaak Vilo

2009 Fall

Summary so far

• Data preparation

• Machine learning

• Statistics/significance

• Large data – algorithmics

• Visualisation

• Queries/reporting, OLAP

• Different types of data

• Business valueJaak Vilo and other authors UT: Data Mining 2009 2

Streams, time series

• Time

• Sequence order and position

• Continuosly arriving data

Jaak Vilo and other authors UT: Data Mining 2009 3

Wikipedia

• Data Stream Mining is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities. Examples of data streams include computer network traffic, phone conversations, ATM transactions, web searches, and sensor data. Data stream mining can be considered a subfield of data mining, machine learning, and knowledge discovery.


• In many data stream mining applications, the goal is to predict the class or value of new instances in the data stream given some knowledge about the class membership or values of previous instances in the data stream. Machine learning techniques can be used to learn this prediction task from q plabeled examples in an automated fashion. In many applications, the distribution underlying the instances or the rules underlying their labeling may change over time, i.e. the goal of the prediction, the class to be predicted or the target value to be predicted, may change over time. This problem is referred to as concept drift.


Software

• RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining learning time‐varying concepts andmining, learning time varying concepts, and tracking drifting concept (if used in combination with its data stream mining plugin (formerly: concept drift plugin))


25.11.2009

2

• MOA (Massive Online Analysis): free open‐source software specific for mining datastreams with concept drift. It contains a prequential evaluation method the EDDMprequential evaluation method, the EDDM concept drift methods, a reader of ARFF real datasets, and artificial stream generators asSEA concepts, STAGGER, rotating hyperplane, random tree, and random radius basedfunctions. MOA supports bi‐directionalinteraction with Weka (machine learning).


Literature on Stream Mining

• http://www.csse.monash.edu.au/~mgaber/WResources.htm


Mining Data Streams

What is stream data? Why Stream Data Systems?

Stream data management systems: Issues and solutions

Stream data cube and multidimensional OLAP analysis

November 25, 2009 Data Mining: Concepts and Techniques 9

Stream frequent pattern analysis

Stream classification

Stream cluster analysis

Research issues

Characteristics of Data Streams

Data Streams Data streams—continuous, ordered, changing, fast, huge amount

Traditional DBMS—data stored in finite, persistent data setsdata sets

Characteristics


Huge volumes of continuous data, possibly infinite

Fast changing and requires fast, real-time response

Data stream captures nicely our data processing needs of today Random access is expensive—single scan algorithm (can only have

one look)

Store only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in

nature, needs multi-level and multi-dimensional processing

Stream Data Applications

Telecommunication calling records

Business: credit card transaction flows Network monitoring and traffic engineering

Financial market: stock exchange

E i i & i d t i l l &


Engineering & industrial processes: power supply & manufacturing

Sensor, monitoring & surveillance: video streams, RFIDs

Security monitoring

Web logs and Web page click streams Massive data sets (even saved but random access is too

expensive)

DBMS versus DSMS

Persistent relations

One-time queries

Random access

“Unbounded” disk store

Only current state matters

Transient streams

Continuous queries

Sequential access

Bounded main memory

Historical data is important


Only current state matters

No real-time services

Relatively low update rate

Data at any granularity

Assume precise data

Access plan determined by query processor, physical DB design

Historical data is important

Real-time requirements

Possibly multi-GB arrival rate

Data at fine granularity

Data stale/imprecise

Unpredictable/variable data arrival and characteristics

Ack. From Motwani’s PODS tutorial slides

25.11.2009

3

Mining Data Streams

What is stream data? Why Stream Data Systems?

Stream data management systems: Issues and solutions

Stream data cube and multidimensional OLAP analysis


Stream frequent pattern analysis

Stream classification

Stream cluster analysis

Research issues

Architecture: Stream Query Processing

User/ApplicationUser/Application

Continuous QueryContinuous Query

ResultsResults

SDMS (Stream Data Management System)


Scratch SpaceScratch Space(Main memory and/or Disk)(Main memory and/or Disk)

Stream QueryStream QueryProcessorProcessor

Multiple streamsMultiple streams

Stream Data Mining Tasks

On-Line analysis of streams Clustering data streams Classification of data streams Mining frequent patterns in data streams g q p Mining sequential patterns in data streams Mining partial periodicity in data streams Mining outliers and unusual patterns in data

streams ……

Clustering on Streams

K-means - not suitable for stream mining

Clustream- assume shape of the cluster is always assume shape of the cluster is always

circle. Denstream

- detects arbitrary shape clusters in stream data.

Frequent Pattern Mining (FPM) in data streams.Frequent (/hot/top) patterns:Items/Item sets/Sequences occurring, frequently in a database.

ISSUES

Frequent Pattern Mining (FPM)in data streams

-Limited memory

-Reading past data is impossible.

Question: How much is it justified to mine frequent pattern only in data stream??

Infrequent pattern mining

Objective:1. To find-out the abnormality , surprising or

“interesting” pattern in the data stream.2. Mutual pattern mining.3 Stream specific item set mining3. Stream specific item set mining.4. Association Rule mining among event of interest.

Application:1. Text mining.2. Distributed Sensor Networks.3. Works well for evolving data stream.

25.11.2009

4

Challenges in Stream Data Analysis

• Data Volume is Huge• Need to remember recent and historical data• Approaches to data reduction• Need single linear scan algorithms• Most existing algorithms and prototype systems are

memory and CPU bound and can only perform a single memory and CPU bound, and can only perform a single data mining function

• Desire to perform multiple analysis at the same time• Occurrence of concept drifts where previous model is

no longer valid• Reduce the cost of learning where models need to be

updated and replaced• Require instant response

Loretta Auvil

Stream Data Reduction

• Challenges of “OLAP-ing” stream data• Raw data cannot be stored• Simple aggregates are not powerful enough• History shape and patterns at different levels are desirable

• MAIDS Unique Approach• A tilted time window to aggregate data at different points

in time• A scalable multi-dimensional stream data cube that can

aggregate a model of stream data efficiently without accessing the raw data

Loretta Auvil

MAIDS Approach: Tilted Time Window

• Recent data is registered and weighted at a finer granularity than longer term data

• As the edge of a time widow is reached, the finer granularity data is summarized and propagated to a courser granularity

• Window is maintained automatically

24hrs 4qtrs 15minutes7days

Past

30sec

TimePresent

Loretta Auvil

MAIDS: Stream Mining Architecture

MAIDS is aimed to:• Discover changes,

trends and evolution characteristics in data streams

• Construct clusters and classification models from data streams

• Explore frequent patterns and similarities among data streams

Loretta Auvil

Features of MAIDS

• General purpose tool for data stream analysis• Processes high-rate and multi-dimensional data• Adopts a flexible tilted time window framework• Facilitates multi-dimensional analysis using a stream

cube architecturecube architecture• Integrates multiple data mining functions• Provides user-friendly interface: automatic analysis and

on-demand analysis• Facilitates setting alarms for monitoring• Built in D2K as D2K modules and leveraged in the D2K

Streamline tool

Loretta Auvil

Statistics Query Engine

• Answers user queries on data statistics, such as, count, max, min, average, regression, etc.

U tilt d ti • Uses tilted time window

• Uses an efficient data structure, H-tree for partial computation of data cubes

Loretta Auvil

25.11.2009

5

Stream Data Classifier

• Builds models to make predictions

• Uses Naïve Bayesian Classifier with boosting

• Uses Tilted Time Window to track time related info

• Sets alarm to monitor events

Loretta Auvil

Stream Pattern Finder

• Find frequent patterns with multiple time granularities

• Keep precise/ compressed history in tilted time windowtilted time window

• Mine only the interested item set using FP-tree algorithm

• Mining evolution and dramatic changes of frequent patterns

Loretta Auvil

Stream Data Clustering

• Two stages: micro-clustering and macro-clustering

• Uses micro-clustering to do incremental, online processing and online processing and maintenance

• Uses tilted time frame

• Detects outliers when new clusters are formed

Loretta Auvil

Demonstration

Loretta Auvil

Significant Advances In the Areas of Data Management and Mining

• Tilted-time window for multi-resolution modeling• Multi-dimensional analysis using a stream cube architecture• Efficient “one-look” stream data mining algorithms:

• classification, frequent pattern analysis, clustering, and information visualization

• Integration of “one-look” approaches into one stream data mining platform so they can cooperate to discover patterns and surprising platform so they can cooperate to discover patterns and surprising events in real-time

• Internationally recognized research leadership in the areas of data management, mining, and knowledge sharing

• Experience in development of robust software framework supporting advanced, data mining and information visualization

• Experience in development of software environments supporting problem solving and evidence-based decision making

Loretta Auvil

Knowledge Extraction from Streaming Text

Information extraction• process of using advanced

automated machine learning approaches

• to identify entities in text documents

• extract this information along with the relationships these pentities may have in the text documents

This project demonstrates information extraction of names, places and organizations from real-time news feeds. As news articles arrive, the information is extracted and displayed.

Loretta Auvil

25.11.2009

6

Challenges of Stream Data Processing

Multiple, continuous, rapid, time-varying, ordered streams

Main memory computations

Queries are often continuous Evaluated continuously as stream data arrives

Ans e pdated o e time


Answer updated over time

Queries are often complex Beyond element-at-a-time processing

Beyond stream-at-a-time processing

Beyond relational queries (scientific, data mining, OLAP)

Multi-level/multi-dimensional processing and data mining Most stream data are at low-level or multi-dimensional in nature

Processing Stream Queries

Query types One-time query vs. continuous query (being evaluated

continuously as stream continues to arrive) Predefined query vs. ad-hoc query (issued on-line)

Unbounded memory requirements


For real-time response, main memory algorithm should be used

Memory requirement is unbounded if one will join future tuples

Approximate query answering With bounded memory, it is not always possible to produce exact

answers

High-quality approximate answers are desired Data reduction and synopsis construction methods

Sketches, random sampling, histograms, wavelets, etc.

Methodologies for Stream Data Processing

Major challenges Keep track of a large universe, e.g., pairs of IP address, not ages

Methodology Synopses (trade-off between accuracy and storage) Use synopsis data structure, much smaller (O(logk N) space) than

their base data set (O(N) space)


Compute an approximate answer within a small error range(factor ε of the actual answer)

Major methods Random sampling Histograms Sliding windows Multi-resolution model Sketches Radomized algorithms

Stream Data Processing Methods (1)

Random sampling (but without knowing the total length in advance)

Reservoir sampling: maintain a set of s candidates in the reservoir, which form a true random sample of the element seen so far in the stream. As the data stream flow, every new element has a certain probability (s/N) of replacing an old element in the reservoir.

Sliding windows

M k d i i b d l t d t f lidi i d i


Make decisions based only on recent data of sliding window size w An element arriving at time t expires at time t + w

Histograms

Approximate the frequency distribution of element values in a stream

Partition data into a set of contiguous buckets

Equal-width (equal value range for buckets) vs. V-optimal (minimizing frequency variance within each bucket)

Multi-resolution models

Popular models: balanced binary trees, micro-clusters, and wavelets

Stream Data Processing Methods (2) Sketches

Histograms and wavelets require multi-passes over the data but sketches can operate in a single pass

Frequency moments of a stream A = {a1, …, aN}, Fk:where v: the universe or domain size, mi: the frequency of i in the sequence

Given N elts and v values, sketches can approximate F0, F1, F2 in O(log v + log N) space

v

i

kik mF

1


O(log v + log N) space

Randomized algorithms

Monte Carlo algorithm: bound on running time but may not return correct result

Chebyshev’s inequality:

Let X be a random variable with mean μ and standard deviation σ

Chernoff bound:

Let X be the sum of independent Poisson trials X1, …, Xn, δ in (0, 1]

The probability decreases expoentially as we move from the mean

2

2

)|(|k

kXP

4/2

|])1([ eXP

Approximate Query Answering in Streams

Sliding windows Only over sliding windows of recent stream data Approximation but often more desirable in applications

Batched processing, sampling and synopses Batched if update is fast but computing is slow

Comp te pe iodicall not e timel


Compute periodically, not very timely Sampling if update is slow but computing is fast

Compute using sample data, but not good for joins, etc. Synopsis data structures

Maintain a small synopsis or sketch of data Good for querying historical data

Blocking operators, e.g., sorting, avg, min, etc. Blocking if unable to produce the first output until seeing the entire

input

25.11.2009

7

Projects on DSMS (Data Stream Management System)

Research projects and system prototypes

STREAMSTREAM (Stanford): A general-purpose DSMS

CougarCougar (Cornell): sensors

AuroraAurora (Brown/MIT): sensor monitoring, dataflow

Hancock Hancock (AT&T): telecom streams


NiagaraNiagara (OGI/Wisconsin): Internet XML databases

OpenCQOpenCQ (Georgia Tech): triggers, incr. view maintenance

TapestryTapestry (Xerox): pub/sub content-based filtering

TelegraphTelegraph (Berkeley): adaptive engine for sensors

TradebotTradebot (www.tradebot.com): stock tickers & streams

TribecaTribeca (Bellcore): network monitoring

MAIDS MAIDS (UIUC/NCSA): Mining Alarming Incidents in Data Streams

Stream Data Mining vs. Stream Querying

Stream mining—A more challenging task in many cases It shares most of the difficulties with stream querying

But often requires less “precision”, e.g., no join, grouping, sorting

Patterns are hidden and more general than querying


It may require exploratory analysis Not necessarily continuous queries

Stream data mining tasks Multi-dimensional on-line analysis of streams Mining outliers and unusual patterns in stream data Clustering data streams Classification of stream data

Concept drift

• In many applications, the distribution underlying the instances or the rules underlying their labeling may change over time i e the goal of the prediction the classtime, i.e. the goal of the prediction, the class to be predicted or the target value to be predicted, may change over time. This problem is referred to as concept drift.


Episode Rules

• Association rules applied to sequences of events.

• Episode – set of event predicates and partial ordering on them

© Prentice Hall 40

ordering on them

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

•• Association rules describe how things occur together in Association rules describe how things occur together in the datathe data

– E.g., "IF an alarm has certain properties, THEN it will have other given properties"

BasicsBasics

Course on Data MiningCourse on Data Mining 41Page41/54

have other given properties

•• Episode rules describe temporal relationships between Episode rules describe temporal relationships between thingsthings

– E.g., "IF a certain combination of alarms occurs within a time period, THEN another combination of alarms will occur within a time period"


Network Management SystemNetwork Management System

MSC MSCMSCMSC

Switched NetworkSwitched Network

BasicsBasics


BSC BSCBSCBSC

BTSBTS BTSBTSBTSBTS

Access NetworkAccess Network

MSCMSC

BSCBSC

BTSBTS

Base station controller

Base station transceiver

Mobile station controller

Ala

rms

25.11.2009

8


BasicsBasics

•• As defined earlier, telecom data contains alarms:As defined earlier, telecom data contains alarms:

1234 EL1 PCM 940926082623 A1 ALARMTEXT..

Alarm type Date, time Alarm severity class


•• Now we forget about relationships between attributes Now we forget about relationships between attributes within alarms as with the association ruleswithin alarms as with the association rules

•• We just take the alarm number attribute, handle it here We just take the alarm number attribute, handle it here as event/alarm type and inspect the relationships as event/alarm type and inspect the relationships between events/alarmsbetween events/alarms

Alarm numberAlarming network element

Episodes

• Partially ordered set of pages

• Serial episode – totally ordered with time constraint

ll l i d i l d d i h i

© Prentice Hall 44

• Parallel episode – partial ordered with time constraint

• General episode – partial ordered with no time constraint

DAG for Episode

© Prentice Hall 45


•• Data:Data:– Data is a set R of events– Every event is a pair (A, t), where

• A R is the event type (e g alarm type)

BasicsBasics


• A R is the event type (e.g., alarm type)• t is an integer, the occurrence time of the event

– Event sequence s on R is a triple (s, Ts, Te)• Ts is starting time and Te is ending time• Ts < Te are integers• s = (A1, t1), (A2, t2), …, (An, tn) • Ai R and Ts ti < Te for all i=1, …, n


•• Example alarm data sequence:Example alarm data sequence:

BasicsBasics

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150

D C A B D A B C A D C A B D A


•• Here:Here:– A, B, C and D are event (or here alarm) types– 10…150 are occurrence times– s = (D, 10), (C, 20), …, (A, 150) – Ts (starting time) = 10 and Te (ending time) = 150

•• Note: There needs Note: There needs notnot to be events on every time slot!to be events on every time slot!


•• Episodes:Episodes:

– An episode is a pair (V, )

• V is a collection of event types, e.g., alarm types

BasicsBasics


• is a partial order on V

– Given a sequence S of alarms, an episode = (V, )occurs within S if there is a way of satisfying the event types (e.g., alarm types) in V using the alarms of S so that the partial order is respected

– Intuitively: episodes consist of alarms that have certain properties and occur in a certain partial order

25.11.2009

9


•• The most useful partial orders are:The most useful partial orders are:

– Total orders

• The predicates of each episode have a fixed order

h i d ll d l ( d d )

BasicsBasics


• Such episodes are called serial (or "ordered")

– Trivial partial orders

• The order of predicates is not considered

• Such episodes are called parallel (or "unordered")

•• Complicated?Complicated?

– Not really, let's take some clarifying examples


•• Examples:Examples:

BasicsBasics

A B A A


Serial episode

B

Parallel episode

B

C

More complex episode with

serial and parallel


•• The name of the WINEPI method comes from the The name of the WINEPI method comes from the technique it uses: a sliding windowtechnique it uses: a sliding window

•• Intuitively: Intuitively:

– A window is slided through the event-based data

WINEPI ApproachWINEPI Approach


A window is slided through the event-based data sequence

– Each window "snapshot" is like a row in a database

– The collection of these "snapshots" forms the rows in the database

•• Complicated?Complicated?

– Not really, let's take a clarifying example





0 10 20 30 40 50 60 70 80 90

D C A B D A B C

•• The window width is 40 seconds, last point excluded The window width is 40 seconds, last point excluded •• The first/last window contains only the first/last eventThe first/last window contains only the first/last event


• Formally, given a set E of event types an event sequence an event sequence S S = (= (ss,,TTss,T,Tee)) is an ordered sequence of events eventi such that eventi eventi+1 for all i=1, …, n-1, and Ts eventi < Te for all i=1, …, n



e , ,

Ts Te

t1 t2 t3 … … tn

event1 event2 event3 … … eventn


• Formally, a windowwindow on event sequence S is an event sequence S=(w,ts,te), where ts < Te, te > Ts, and w consists of those pairs (event, t) from s where ts t < te

• The value t t < t is called window width W



• The value ts t < te is called window width, W

Ts Te

t1 t2 t3 tn


WWttss ttee

25.11.2009

10


• By definition, the first and the last windows on a sequence extend outside the sequence, so that the last window contains only the first time point of the sequence, and the last window only the last time point



y p

WWttss ttee

Ts Te

t1 t2 t3 tn


WWttss ttee


• The frequencyfrequency (cf. support with association rules) of an episode is the fraction of windows in which the episode occurs, i.e.,



|Sw W(S, W) | occurs in Sw |fr(, S, W) =

|W(S, W)|

where W(S, W) is the set of all windows Sw of sequence S such that the window width is W


• When searching for the episodes, a frequency thresholdfrequency threshold(cf. support threshold with association rules) min_fr is used

• Episode is frequent if fr(, s, win) min_fr, i.e, "if the freq enc of e ceeds the minim m freq enc threshold



frequency of exceeds the minimum frequency threshold within the data sequence s and with window width win"

• F(s, win, min_fr): a collection of frequent episodes in swith respect to win and min_fr

•• Apriori trick holds:Apriori trick holds: if an episode is frequent in an event sequence s, then all subepisodes are frequent


•• FormallyFormally, an episode rule is as expression , where and are episodes such that is a subepisode of

• An episode is a subepisode of ( ), if the graph representation is a subgraph of the representation of



representation is a subgraph of the representation of

A

B

:

A

B

C:


• The fraction

fr(, S, W) = frequency of the whole episode

fr( S W) = frequency of the LHS episode



fr(, S, W) frequency of the LHS episode

is the confidenceconfidence of the WINEPI episode rule

• The confidence can be interpreted as the conditional probability of the whole of occurring in a window, given that occurs in it


•• Intuitively: Intuitively:

– WINEPI rules are like association rules, but with an additional time aspect:

If e ents (alarms) satisf ing the r le antecedent (left



If events (alarms) satisfying the rule antecedent (left-hand side) occur in the right order within W time units, then also the rule consequent (right-hand side) occurs in the location described by , also within W time units

antecedent antecedent consequent [window width] (f, c)consequent [window width] (f, c)

25.11.2009

11


•• InputInput: A set R of event/alarmtypes, an event sequence s over R, a set Eof episodes, a window width win, and a frequency threshold min_fr

•• OutputOutput: The collection F(s, win, min_fr)•• MethodMethod:

1. compute C1 := { E | || = 1};

WINEPI AlgorithmWINEPI Algorithm


p 1 { | | | };2. i := 1;3. while Ci do4.(* compute F(s, win, min_fr) := { Ci | fr(, s, win) min_fr};5. i := l+1;6.(** compute Ci:= { E | || = I, and F||(s, win, min_fr) for

all E, };

(* = database pass, (** candidate generation


• First problem: given a sequence and a episode, find out whether the episode occurs in the sequence

• Finding the number of windows containing an occurrence of the episode can be reduced to this



• Successive windows have a lot in common• How to use this?

– An incremental algorithm– Same idea as for association rules– A candidate episode has to be a combination of two episodes of

smaller size– Parallel episodes, serial episodes


•• Parallel episodes:Parallel episodes:

– For each candidate maintain a counter .event_count: how many events of are present in the window

When t t becomes eq al to || indicating



– When .event_count becomes equal to ||, indicating that is entirely included in the window, save the starting time of the window in .inwindow

– When .event_count decreases again, increase the field .freq_count by the number of windows where remainded entirely in the window

•• Serial episodes: use a state automataSerial episodes: use a state automata





0 10 20 30 40 50 60 70 80 90

D C A B D A B C

•• The window width is 40 secs, The window width is 40 secs, movement stepmovement step 10 secs 10 secs •• The length of the sequence is 70 secs (10The length of the sequence is 70 secs (10--80)80)


•• By sliding the window, we'll get 11 windows (UBy sliding the window, we'll get 11 windows (U11--UU1111): ):


U2

…


0 10 20 30 40 50 60 70 80 90

D C A B D A B C

•• Frequency threshold is set to 40%, i.e., an episode has Frequency threshold is set to 40%, i.e., an episode has to occur at least in 5 of the 11 windowsto occur at least in 5 of the 11 windows

U1

U2U11




25.11.2009

12


•• Suppose that the task is to find all parallel episodes:Suppose that the task is to find all parallel episodes:– First, create singletons, i.e., parallel episodes of size 1 (A, B, C, D)

– Then, recognize the frequent singletons (here all are)

– From those frequent episodes, build candidate episodes of size 2:



From those frequent episodes, build candidate episodes of size 2: AB, AC, AD, BC, BD, CD

– Then, recongize the frequent parallel episodes (here all are)

– From those frequent episodes, build candidate episodes of size 3: ABC, ABD, ACD, BCD

– When recognizing the frequent episodes, only ABD occurs in more than four windows

– There are no candidate episodes of size four


•• Episode frequencies and example rules with WINEPI:Episode frequencies and example rules with WINEPI:

D : 73%C : 73%A : 64%



B : 64% D A [40] (55%, 75%)D C : 45%D A : 55%D B : 45% D A B [40] (45%, 82%)C A : 45%C B : 45%A B : 45%D A B : 45%


•• Data:Data:

– Alarms from a telecommunication network

– 73 000 events (7 weeks), 287 event types

ll l d i l i d

WINEPI: Experimental ResultsWINEPI: Experimental Results


– Parallel and serial episodes

– Window widths (W) 10-120 seconds

– Window movement = W/10

– min_fr = 0.003 (0.3%), frequent: about 100 occurrences

– 90 MHz Pentium, 32MB memory, Linux operating system. The data resided in a 3.0 MB flat text file


WINEPI: Experimental ResultsWINEPI: Experimental Results

Window Serial episodes Parallel episodeswidth (s) #frequent time (s) #frequent time (s)10 16 31 10 8


20 31 63 17 940 57 117 33 1460 87 186 56 1580 145 271 95 21100 245 372 139 21120 359 478 189 22


•• One shortcoming in WINEPI approach:One shortcoming in WINEPI approach:

– Consider that two alarms of type A and one alarm of type B occur in a window

Does the parallel episode consisting of A and B appear



– Does the parallel episode consisting of A and B appear once or twice?

– If once, then with which alarm of type A?

0 10 20 30 40 50 60 70 80 90

D C A B D A B C


•• Alternative approach to discovery of episodesAlternative approach to discovery of episodes

– No sliding windows

– For each potentially interesting episode, find out the e act occ rrences of the episode

MINEPI ApproachMINEPI Approach


exact occurrences of the episode

•• Advantages:Advantages: easy to modify time limits, several time limits for one rule:

"If A and B occur within 15 seconds, then C follows within 30 seconds"

•• Disadvantages:Disadvantages: uses a lots of space

25.11.2009

13


• Formally, given a episode and an event sequence S, the interval [ts,te] is a minimal occurrenceminimal occurrence of S,

– If occurs in the window corresponding to the interval

If does not occ r in an proper s binter al



– If does not occur in any proper subinterval

• The set of minimal occurrencesset of minimal occurrences of an episode in a given event sequence is denoted by mo():

mo() = { [ts,te] | [ts,te] is a minimal occurrence of }


• Example: Parallel episode consisting of event types Aand B has three minimal occurrences in s: {[30,40], [40,60], [60,70]}, has one occurrence in s: {[60,80]}



D C A B D A B C

0 10 20 30 40 50 60 70 80 90

A

B

A

BC: :


•• InformallyInformally, a MINEPI episode rule gives the conditional probability that a certain combination of events (alarms) occurs within some time bound, given that another combi-nation of events (alarms) has occurred within a time bound



( )

• Formally, an episode ruleepisode rule is [win1] [win2]

• and are episodes such that ( is a subepisode of )

• If episode has a minimal occurrence at interval [ts,te] with te - ts win1, then episode occurs at interval [ts,t'e] for some t'e such that t'e - ts win2

Pattern Discovery

1. Choose the language (formalism) to represent the patterns (search space)

2. Choose the rating for patterns, to tell which is “better” than others

3. Design an algorithm that finds the best

patterns from the pattern class, fast.

Brazma A, Jonassen I, Eidhammer I, Gilbert D.Approaches to the automatic discovery of patterns in biosequences.J Comput Biol. 1998;5(2):279-305.

Level 1

Level 2

Level 3

Eukaryotic genome can be thought of as six Levels of DNA structure.

The loops at L l 4

Level 0 ATCGCTGAATTCCAATGTG

Level 4

Level 5

Level 6

Level 4 range from 0.5kb to 100kb in length.

If these loops were stabilized then the genes inside the loop would not be expressed.

DNAGenBank / EMBL Bank

ProteinSwissProt/TrEMBL

StructurePDB/Molecular Structure Database

DNA determines function (?)

4 Nucleotides 20+ Amino Acids(3nt 1 AA)

Function?

25.11.2009

14

A Simple Gene

A: B: C:

ATCGAAATTAGCTTTA

+Modifications

Upstream/promoter

Downstream

DNA:

Species and individuals

• Animals, plantsfungi, bacteria, …

S i• Species

• Individuals

www.tolweb.org

Gene Regulatory Signal Finding

Transcription Factor

Transcription Factor Binding Site

Goal: Detect Transcription Factor Binding Sites.Eleazar Eskin: Columbia Univ.

TGTTCTTTCTTCTTTCATACATCCTTTTCCTTTTTTTCCTTCTCCTTTCATTTCCTGACTTTTAATATAGGCTTACCATCCTTCTTCTCTTCAATAACCTTCTTACATTGCTTCTTCTTCGATTGCTTCAAAGTAGTTCGTGAATCATCCTTCAATGCCTCAGCACCTTCAGCACTTGCACTTCATTCTCTGGAAGTGCTGCACCTGCGCTGTCTTGCTAATGGATTTGGAGTTGGCGTGGCACTGATTTCTTCGACATGGGCGGCGTCTTCTTCGAATTCCATCAGTCCTCATAGTTCTGTTGGTTCTTTTCTCTGATGATCGTCATCTTTCACTGATCTGATGTTCCTGTGCCCTATCTATATCATCTCAAAGTTCACCTTTGCCACTTTCCAAGATCTCTCATTCATAATGGGCTTAAAGCCGTACTTCCAAGATCTCTCATTCATAATGGGCTTAAAGCCGTACTTTTTTCACTCGATGAGCTATAAGAGTTTTCCACTTTTAGATCGTGGCTGGGCTTATATTACGGTGTGATGAGGGCGCTTGAAAAGATTTTTTCATCTCACAAGCGACGAGGGCCCGAGTGTTTGAAGCTAGATGCAGTAGGTGCAAGCGTAGAGTCTTAGAAGATAAAGTAGTGAATTACAATAGATTCGATAC

Patterns: AT

25.11.2009

15

Patterns: [AT][ACT]ATIUPAC: W H AT

Cluster of co‐expressed genes, pattern discovery in regulatory regions

600 basepairs

Expression profiles

Upstream regions

Retrieve

Find patterns over-represented within clusterGenome Research 1998; ISMB (Intelligent Systems in Mol. Biol.) 2000

Binomial or hypergeometric distribution

Background -ALL upstream

sequencesCluster:

occurs 3 times

5 out of 25, p = 0.2

P(3,6,0.2) is probabilityof having 3 matches in 6 sequences

P(,3,6,0.2)=0.0989

Pattern vs cluster “strength”

The pattern probability vs. the average silhouette for the cluster

The same for randomised clusters

Vilo et.al. ISMB 2000

The most unprobable pattern from best clusters

Pattern Probability Cluster Occurrences Total nr of Ksize in cluster occurrences in K-means

AAAATTTT 2.59E-43 96 72 830 60ACGCG 6.41E-39 96 75 1088 50ACGCGT 5.23E-38 94 52 387 40CCTCGACTAA 5.43E-38 27 18 23 220GACGCG 7.89E-31 86 40 284 38TTTCGAAACTTACAAAAAT 2.08E-29 26 14 18 450TTCTTGTCAAAAAGC 2.08E-29 26 14 18 325ACATACTATTGTTAAT 3.81E-28 22 13 18 280GATGAGATG 5.60E-28 68 24 83 84TGTTTATATTGATGGA 1.90E-27 24 13 18 220GATGGATTTCTTGTCAAAA 5.04E-27 18 12 18 500GATGGATTTCTTGTCAAAA 5.04E 27 18 12 18 500TATAAATAGAGC 1.51E-26 27 13 18 300GATTTCTTGTCAAA 3.40E-26 20 12 18 700GATGGATTTCTTG 3.40E-26 20 12 18 875GGTGGCAA 4.18E-26 40 20 96 180TTCTTGTCAAAAAGCA 5.10E-26 29 13 18 250CGAAACTTACAAA 5.10E-26 29 13 18 290GAAACTTACAAAAATAAA 7.92E-26 21 12 18 650TTTGTTTATATTG 1.74E-25 22 12 18 600ATCAACATACTATTGT 3.62E-25 23 12 18 375ATCAACATACTATTGTTA 3.62E-25 23 12 18 625GAACGCGCG 4.47E-25 20 11 13 260GTTAATTTCGAAAC 7.23E-25 24 12 18 400GGTGGCAAAA 3.37E-24 33 14 31 475ATCTTTTGTTTATATTGA 7.19E-24 19 11 18 675TTTGTTTATATTGATGGA 7.19E-24 19 11 18 475GTGGCAAA 1.14E-23 28 18 137 725

Vilo et.al. ISMB 2000

25.11.2009

16

YOR261C YOR261C RPN8 protein degradation 26S proteasome regulatory subunit S0005787 1YDL020C YDL020C RPN4 protein degradation, ubiquitin26S proteasome subunit S0002178 1YDL007W YDL007W RPT2 protein degradation 26S proteasome subunit S0002165 1YDL147W YDL147W RPN5 protein degradation 26S proteasome subunit S0002306 1YOL038W YOL038W PRE6 protein degradation 20S proteasome subunit (alpha4) S0005398 1YKL145W YKL145W RPT1 protein degradation, ubiquitin26S proteasome subunit S0001628 1YDL097C YDL097C RPN6 protein degradation 26S proteasome regulatory subunit S0002255 1YDR394W YDR394W RPT3 protein degradation 26S proteasome subunit S0002802 1YBR173C YBR173C UMP1 t i d d ti bi iti 20S t t ti f t S0000377 1

GGTGGCAA - proteasome associated control element

YBR173C YBR173C UMP1 protein degradation, ubiquitin20S proteasome maturation factor S0000377 1YER012W YER012W PRE1 protein degradation 20S proteasome subunit C11(beta4) S0000814 1YPR108W YPR108W RPN7 protein degradation 26S proteasome regulatory subunit S0006312 1YOR117W YOR117W RPT5 protein degradation 26S proteasome regulatory subunit S0005643 1YJL001W YJL001W PRE3 protein degradation 20S proteasome subunit (beta1) S0003538 1YPR103W YPR103W PRE2 protein degradation 20S proteasome subunit (beta5) S0006307 1YOR157C YOR157C PUP1 protein degradation 20S proteasome subunit (beta2) S0005683 1YGL048C YGL048C RPT6 protein degradation 26S proteasome regulatory subunit S0003016 1YHR200W YHR200W RPN10 protein degradation 26S proteasome subunit S0001243 1YML092C YML092C PRE8 protein degradation 20S proteasome subunit Y7 (alpha2 S0004557 1YIL075C YIL075C RPN2 tRNA processing 26S proteasome subunit) S0001337 1YMR314W YMR314W PRE5 protein degradation 20S proteasome subunit(alpha6) S0004931 1YGR253C YGR253C PUP2 protein degradation 20S proteasome subunit(alpha5) S0003485 1YGR135W YGR135W PRE9 protein degradation 20S proteasome subunit Y13 (alpha3) S0003367 1YFR004W YFR004W RPN11 transcription putative global regulator S0001900 1YOR259C YOR259C RPT4 protein degradation 26S proteasome regulatory subunit S0005785 1YFR052W YFR052W RPN12 protein degradation 26S proteasome regulatory subunit S0001948 1YFR050C YFR050C PRE4 protein degradation proteasome subunit, B type S0001946 1YGL011C YGL011C SCL1 protein degradation 20S proteasome subunit YC7ALPHA/Y8 S0002979 1YDR427W YDR427W RPN9 protein degradation 26S proteasome regulatory subunit S0002835 1YOR362C YOR362C PRE10 protein degradation 20S proteasome subunit C1 (alpha7) S0005889 1YBL041W YBL041W PRE7 protein degradation 20S proteasome subunit S0000137 1YER021W YER021W RPN3 protein degradation 26S proteasome regulatory subunit S0000823 1YER094C YER094C PUP3 protein degradation 20S proteasome subunit (beta3 S0000896 1YGR270W YGR270W YTA7 protein degradation 26S proteasome subunit; ATPase S0003502 1YHR027C YHR027C RPN1 protein degradation 26S proteasome regulatory subunit S0001069 1YER047C YER047C SAP1 mating type switching AAA family protein S0000849 1YGR232W YGR232W unknown unknown S0003464 1

>YAL036C chromo=1 coord=(76154-75048(C)) start=-600 end=+2 seq=(76152-76754)

TGTTCTTTCTTCTTCTGCTTCTCCTTTTCCTTTTTTTCCTTCTCCTTTTCCTTCTTGGACTTTAGTATAGGCTTACCATCCTTCTTCTCTTCAATAACCTTCTTTTCTTGCTTCTTCTTCGATTGCTTCAAAGTAGACATGAAGTCGCCTTCAATGGCCTCAGCACCTTCAGCACTTGCACTTGCTTCTCTGGAAGTGTCATCTGCACCTGCGCTGCTTTCTGGATTTGGAGTTGGCGTGGCACTGATTTCTTCGTTCTGGGCGGCGTCTTCTTCGAATTCCTCATCCCAGTAGTTCTGTTGGTTCTTTTTACTCTTTTTCGCCATCTTTCACTTATCTGATGTTCCTGATTGCCCTTCTTATCCCCTCAAAGTTCACCTTTGCCACTTATTCTAGTGCAAGATCTCTTGCTTTCAATGGGCTTAAAGCTTGAAAAATTTTTTCACATCACAAGCGACGAGGGCCCGTTTTTTTCATCGATGAGCTATAAGAGTTTTCCACTTTTAAGATGGGATATTACGGTGTGATGAGGGCGCAATGATAGGAAGTGTTTGAAGCTAGATGCAGTAGGTGCAAGCGTAGAGTTGTTGATTGAGCAAA_ATG_>YAL025C chromo=1 coord=(101147-100230(C)) start=-600 end=+2 seq=(101145-101747)CTTAGAAGATAAAGTAGTGAATTACAATAAATTCGATACGAACGTTCAAATAGTCAAGAATTTCATTCAAAGGGTTCAATGGTCCAAGTTTTACACTTTCAAAGTTAACCACGAATTGCTGAGTAAGTGTGTTTATATTAGCACATTAACACAAGAAGAGATTAATGAACTATCCACATGAGGTATTGTGCCACTTTCCTCCAGTTCCCAAATTCCTCTTGTAAAAAACTTTGCATATAAAATATACAGATGGAGCATATATAGATGGAGCATACATACATGTTTTTTTTTTTTTAAAAACATGGACTCGAACAGAATAAAAGAATTTATAATGATAGATAATGCATACTTCAATAAGAGAGAATACTTGTTTTTAAATGAGAATTGCTTTCATTAGCTCATTATGTTCAGATTATCAAAATGCAGTAGGGTAATAAACCTTTTTTTTTTTTTTTTTTTTTTTTGAAAAATTTTCCGATGAGCTTTTGAAAAAAAATGAAAAAGTGATTGGTATAGAGGCAGATATTGCATTGCTTAGTTCTTTCTTTTGACAGTGTTCTCTTCAGTACATAACTACAACGGTTAGAATACAACGAGGAT_ATG_

...>YBR084W chromo=2 coord=(411012-413936) start=-600 end=+2 seq=(410412-411014)CCATGTATCCAAGACCTGCTGAAGATGCTTACAATGCCAATTATATTCAAGGTCTGCCCCAGTACCAAACATCTTATTTTTCGCAGCTGTTATTATCATCACCCCAGCATTACGAACATTCTCCACATCAAAGGAACTTTACGCCATCCAACCAATCGCATGGGAACTTTTATTAAATGTCTACATACATACATACATCTCGTACATAAATACGCATACGTATCTTCGTAGTAAGAACCGTCACAGATATGATTGAGCACGGTACAATTATGTATTAGTCAAACATTACCAGTTCTCGAACAAAACCAAAGCTACTCCTGCAACACTCTTCTATCGCACATGTATGGTTCTTATTGTTTCCCGAGTTCTTTTTTACTGACGCGCCAGAACGAGTAAGAAAGTTCTCTAGCGCCATGCTGAAATTTTTTTCACTTCAACGGACAGCGATTTTTTTTCTTTTTCCTCCGAAATAATGTTGCAGCGGTTCTCGATGCCTCAAGAATTGCAGAAGTAAACCAGCCAATACACATCAAAAAACAACTTTCATTACTGTGATTCTCTCAGTCTGTTCATTTGTCAGATATTTAAGGCTAAAAGGAA_ATG_

101 Sequences relative to ORF start

GATGAG.T 1:52/70 2:453/508 R:7.52345 BP:1.02391e-33G.GATGAG.T 1:39/49 2:193/222 R:13.244 BP:2.49026e-33AAAATTTT 1 63/77 2 833/911 R 4 95687 BP 5 02807 32

YGR128C + 100

AAAATTTT 1:63/77 2:833/911 R:4.95687 BP:5.02807e-32TGAAAA.TTT 1:45/53 2:333/350 R:8.85687 BP:1.69905e-31TG.AAA.TTT 1:53/61 2:538/570 R:6.45662 BP:3.24836e-31TG.AAA.TTTT 1:40/43 2:254/260 R:10.3214 BP:3.84624e-30TGAAA..TTT 1:54/65 2:608/645 R:5.82106 BP:1.0887e-29...

GATGAG.TTGAAA..TTT

Sequence patterns: the basis of the SPEXS

A G A AT C GC C C

GCAT (4 positions)

GCATA (3 positions)

GCATA.C

GCATA.

SPEXS ‐ substrings

• enqueue( Q , Empty pattern (occurs everywhere) )

• while P = deque( Q )

– check all positions of P.pos

F h i P– For every character c in P.pos

• create pattern Pc

• advance all positions in Pc.pos by 1

• enqueue( Q, Pc )


SPEXS: count and memorize

i...v....x....v....xabracadabradadabraca

aa{1,4,6,8,11,13,15,18,20}

{2,5,7,9,12,14,16,19,21}

SPEXS: extend …

a


a

c

{5,19}

b

{2,9,16}

d

{7,12,14}

{2,5,7,9,12,14,16,19,21}

25.11.2009

17

SPEXS: find frequent first

a


a

b

{2,9,16}

d

{7,12,14}

{2,5,7,9,12,14,16,19,21}

SPEXS: group positions

a


a

b

{2,9,16}

d

{7,12,14}

{2,5,7,9,12,14,16,19,21}

[bd]

{2,7,9,12,14,16}

.

The wildcards

GCAT.{3,6}X

The wildcards

GCAT.*X

The wildcards: not too many

aw:0

w:0

.{3.6}bw:1w:0

SPEXS: general algorithm

1. S = input sequences ( ||S||=n )2. e = empty pattern, e.pos = {1,...,n}3. enqueue( order , e )

4. while p = dequeue( order ) 5. generate all allowed extensions p’ of p (& p’.pos)g p p ( p p )6. enqueue( order, p’, priority(p’) ) 7. enqueue( output, p’, fitness(p’) )

8. while p = dequeue( output )9. Output p

Jaak Vilo: Discovering Frequent Patterns from Strings.Technical Report C-1998-9 (pp. 20) May 1998. Department of Computer Science, University of Helsinki.

Jaak Vilo: Pattern Discovery from Biosequences PhD Thesis, Department of Computer Science, University of Helsinki, Finland. Report A-2002-3 Helsinki, November 2002, 149 pages

Applications in bioinformatics:

-Gene regulation (1998: 255+ citations, 2000: 73 cit)

-Functional elements in proteins (2002: 32 cit)

25.11.2009

18

SPEXS ‐ Sequence Pattern EXhaustive SearchJaak Vilo, 1998, 2002

• User‐definable pattern language: substrings, character groups, wildcards, flexible wildcards (c.f. PROSITE)

• Fast exhaustive search over pattern language

• “Lazy suffix tree construction”‐like algorithm (Kurtz, Giegerich)

• Analyze multiple sets of sequences simultaneously

• Restrict search to most frequent patterns only (in each set)

• Reportmost frequent patterns, patterns over‐ or underrepresented in selected subsets, or patterns significant by various statistical criteria, e.g. by binomial distribution

30min

Multiple data setsD1 D2 D3

4/3 (6) 3/3 (12) 2/2 (9)

.G.GATGAG.T. 39 seq

.G.GATGAG.T. 39 seq (vs 193) p= 2.5e-33

-1: .G.GATGAG.T. 61 seq (vs 1292)

-1: .G.GATGAG.T. 61 seq (vs 1292) p= 1.4e-19

-2: .G.GATGAG.T. 91 seq

-2: .G.GATGAG.T. 91 seq (vs 5464)


25.11.2009

19

Jaak Vilo: Pattern Discovery from Biosequences PhD Thesis, Department of Computer Science, University of Helsinki, FinlandSeries of Publications A, Report A-2002-3 Helsinki, November 2002, 149 pages


These hits result in a PWM:

PWM based on all previous hits, here shown highest-scoring occurrences in blue

All against all

approximate matching

For every subsequence of every sequence

Match approximately against all the the sequences.

Approximate hits define PWM matrices (not all positions vary equally).

Look for ALL PWM-s derived from data that are enriched in data set (vs. background).

Hendrik Nigul, Jaak Vilo

Dynamic programming

• Small nr of edit operations allows to limit the search efficiently around main diagonal

Suffix Tree

A

C GT

G

G

T

{1:24,2:12,2:23…}

25.11.2009

20

Trie based all against all approximatematching

• trieindex

• trieagrep

• trieallagrep

• triematrix

Hendrik Nigul, Jaak Vilo

DataMining MTAT.03.183 (4AP 6EAP) Streams, time series · PDF filetechniques can be used to learn this prediction task from labeled ... • Built in D2K as D2K modules and leveraged

Documents