Top Banner

of 35

Machine Learning Introduction Presentation

Apr 03, 2018

Download

Documents

Dino Mandic
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 7/28/2019 Machine Learning Introduction Presentation

    1/35

    Introduction to Machine Learning

    Research on Time Series

    Umaa Rebbapragada

    Tufts University

    Advisor: Carla Brodley

    1/29/07

  • 7/28/2019 Machine Learning Introduction Presentation

    2/35

    Machine Learning (ML)

    Originally a subfield of AI

    Extraction of rules and patterns from datasets

    Focused on: Computational complexity

    Memory

  • 7/28/2019 Machine Learning Introduction Presentation

    3/35

    Machine Learning Tasks for Time

    Series

    Classification

    Clustering Semi-supervised learning

    Anomaly Detection

  • 7/28/2019 Machine Learning Introduction Presentation

    4/35

    Assumptions

    Univariate time series

    Time series databases

  • 7/28/2019 Machine Learning Introduction Presentation

    5/35

    Single Time Series

    A single long time series can be converted into aset of smaller time series by sliding a window

    incrementally across the time series :

    Window length is usually a user-specifiedparameter.

  • 7/28/2019 Machine Learning Introduction Presentation

    6/35

    Challenges of Times Series Data

    High dimensional

    Voluminous Requires fast technique

  • 7/28/2019 Machine Learning Introduction Presentation

    7/35

    Brute Force Similarity Search

    Given query time series Q, the best match bysequential scanning is found by:

    O(nd)

    Finding the nearest neighbor for each time seriesin the database is prohibitive.

  • 7/28/2019 Machine Learning Introduction Presentation

    8/35

    Similarity Search

    Clustering and classification methods

    perform many similarity calculations

    Some require storage of the k nearest

    neighbors of each data instance

    Critical that these calculations be fast

  • 7/28/2019 Machine Learning Introduction Presentation

    9/35

    Speeding up Similarity Search

    Alternate time series representations Search databases faster

    New similarity metrics

  • 7/28/2019 Machine Learning Introduction Presentation

    10/35

    Data Mining Time Series Toolbox

    Indexing

    Dimensionality Reduction Segmentation

    Discretization

    Similarity metric

  • 7/28/2019 Machine Learning Introduction Presentation

    11/35

    Indexing

    Faster than a sequential scan

    Insertions and deletions do not require rebuilding

    the entire index Partition the data into regions

    Search regions that contain a likely match

    Requires a similarity metric that obeys triangle

    inequality

  • 7/28/2019 Machine Learning Introduction Presentation

    12/35

    Indexing

    R-trees

    kd-trees

    linear quad-trees

    grid-files

  • 7/28/2019 Machine Learning Introduction Presentation

    13/35

    Indexing on Times Series Data

    High dimensionality slows down speed ofcomputation

    Curse of dimensionality inhibits efficiencyof of indexing

  • 7/28/2019 Machine Learning Introduction Presentation

    14/35

    Dimensionality Reduction

    Reduces the size of the time series

    Distance on transformed data should lower

    bound the original distance

    This guarantees no false dismissals(false negatives)

  • 7/28/2019 Machine Learning Introduction Presentation

    15/35

    Dimensionality Reduction:

    DFT, DWT, SVD

    Represent time series using subsets of

    Fourier coefficients

    Wavelet coefficients

    eigenvalue/vectors

    Euclidean-distance is lower-bounded on

    DFT1, DWT2, SVD3

    [1] C. Faloutsos et al.: Fast Subsequence Matching in Time-Series Databases. SIGMOD Conference 1994: 419-429

    [2] K. Chan and A. Fu: Efficient Time Series Matching by Wavelets. ICDE 1999: 126-133

    [3] F. Korn et al.: Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences. SIGMOD Conference 1997: 289-300

  • 7/28/2019 Machine Learning Introduction Presentation

    16/35

    Gemini Framework

    Faloutsos et al., 1994

    Map each time series to a lower dimension

    Store in multi-dimensional indexing

    structure

    C. Faloutsos et al.: Fast Subsequence Matching in Time-Series Databases. SIGMOD Conference 1994: 419-429

  • 7/28/2019 Machine Learning Introduction Presentation

    17/35

    Piecewise Aggregate Approximation

    (PAA)

    Eamonn J. Keogh, et al.: Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases.

    Knowl. Inf. Syst. 3(3): 263-286 (2001)

    Fig: Eamonn J. Keogh, et al.: HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence. ICDM 2005: 226-233

  • 7/28/2019 Machine Learning Introduction Presentation

    18/35

    Segmentation

    Represent the time series in smaller, less

    complex segments.

    Piecewise Linear Approximation (PLA)

    Minimum Bounding Rectangles (MBR)

  • 7/28/2019 Machine Learning Introduction Presentation

    19/35

    Piecewise Linear Approximation

    (PLA)

  • 7/28/2019 Machine Learning Introduction Presentation

    20/35

    Minimum-Bounding Rectangles

    (MBR)

    Fig: A. Anagnostopoulos et al: Global distance-based segmentation of trajectories. SIGKDD Conference 2006: 34-43

  • 7/28/2019 Machine Learning Introduction Presentation

    21/35

    Discretization

    Transforms a real-valued time series into a

    sequence of characters from a discrete

    alphabet

    Dimensionality reduction implicit

    Allows use of string functions on time

    series

  • 7/28/2019 Machine Learning Introduction Presentation

    22/35

    SAX

    Jessica Lin et al. A symbolic representation of time series, with implications for streaming algorithms. DMKD 2003: 2-11

    Fig: Eamonn J. Keogh, et al.: HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence. ICDM 2005: 226-233

  • 7/28/2019 Machine Learning Introduction Presentation

    23/35

    Is Euclidean Distance Best Metric?

    Everything discussed so far used ED as

    similarity metric

    Is it the best similarity metric for time

    series?

  • 7/28/2019 Machine Learning Introduction Presentation

    24/35

    Drawbacks of Euclidean Distance

    Requires two time series to have same

    dimensionality

    1-to-1 alignment of the time axis

  • 7/28/2019 Machine Learning Introduction Presentation

    25/35

    Cross Correlation

    Cross correlation with convolution can find

    optimal phase shift to maximize similarity

    Fig: P. Protopapas et al.: Finding outlier light-curves in catalogs of periodic variable stars. Mon. Not. Roy. Astron. Soc. 369 (2006) 677-696

  • 7/28/2019 Machine Learning Introduction Presentation

    26/35

    Cross Correlation

    Optimal phase shift (to left) of solid line is

    0.3

    Fig: P. Protopapas et al.: Finding outlier light-curves in catalogs of periodic variable stars. Mon. Not. Roy. Astron. Soc. 369 (2006) 677-696

  • 7/28/2019 Machine Learning Introduction Presentation

    27/35

    Warped Time Axis

    Dynamic Time Warping (DTW)

    DTW allows many-to-one alignment

    Time series need not be same size

    Fig: Y. Sakurai, et al.: FTW: fast similarity search under the time warping distance. PODS 2005: 326-337

    D. J. Berndt, and J. Clifford: Finding Patterns in Time Series: A Dynamic Programming Approach.

    Advances in Knowledge Discovery and Data Mining 1996: 229-248

  • 7/28/2019 Machine Learning Introduction Presentation

    28/35

    DTW Algorithm

  • 7/28/2019 Machine Learning Introduction Presentation

    29/35

    DTW Algorithm

    Fig: Y. Sakurai, et al.: FTW: fast similarity search under the time warping distance. PODS 2005: 326-337

  • 7/28/2019 Machine Learning Introduction Presentation

    30/35

    Drawbacks of DTW

    Computationally expensive

    Does not adhere to triangle inequality =>

    cannot use it for indexing

  • 7/28/2019 Machine Learning Introduction Presentation

    31/35

    Making DTW Faster

    Global constraints:

    Sakoe-Chiba Band Itakura Parallelogram

    Y. Sakurai, et al.: FTW: fast similarity search under the time warping distance. PODS 2005: 326-337

  • 7/28/2019 Machine Learning Introduction Presentation

    32/35

    Making DTW Faster Y. Sakurai et al.: FTW: fast similarity search under the time warping

    distance. PODS 2005: 326-337

    E. Keogh and C. Ratanamahatana: Exact indexing of dynamic time

    warping. Knowl. Inf. Syst. 7(3): 358-386 (2005) Y. Zhu and D. Shasha: Warping Indexes with Envelope Transforms for

    Query by Humming. SIGMOD Conference 2003: 181-192

    E. Keogh and M. Pazzani: Scaling up dynamic time warping for

    datamining applications. KDD 2000: 285-289

    B.-K. Yi et al.: Efficient Retrieval of Similar Time Sequences UnderTime Warping. ICDE 1998: 201-208

  • 7/28/2019 Machine Learning Introduction Presentation

    33/35

    Other Areas of Research

    Anomaly Detection

    Change Point Detection

  • 7/28/2019 Machine Learning Introduction Presentation

    34/35

    Thesis Research

    Anomaly detection methods

    fast

    preserve interesting features

  • 7/28/2019 Machine Learning Introduction Presentation

    35/35

    Thank You