Top Banner
Time Series Analysis via Barcodes A.B.C. Balbuena a , R.T. Melton b , J.A. Nable c , and M.V.V. Visaya d,* a Institute of Mathematics, University of the Philippines-Diliman Email : [email protected] b Email : [email protected] c Department of Mathematics, Ateneo De Manila University, Philippines Email : [email protected] d Department of Pure and Applied Mathematics University of Johannesburg, South Africa Email : [email protected] Abstract Topological data analysis is used to study one-dimensional time series from observed data. The time series is a set of unorganized sample points (point cloud data) and is reconstructed in 2-dimensional space. The underlying structure is approximated by constructing simplicial complexes from the reconstruction. We impose topological structures on the constructed time series and compute the Betti numbers of the persistent homology. These calculations give us qualitative information which we use to impose structure on a time series. We use deterministic and random time series and present difference of their topological structures. Keywords: Persistent homology, Barcode, Time series, Topological data analysis, Dynamical systems, Metric entropy Shapes analysis, Statistical modeling 2000 MSC: 37M10, 55N99, 1. Introduction Empirical data is usually nonlinear, characterized by high-dimensionality, redundancy, and is produced in vast quantity. Time series data produced by most experimental systems are often corrupted by noise. Different approaches have been employed to analyze time series data - via statistical, dynamical systems, and complex networks, among others ([19], [21]). The analysis of such data requires weeding out the equivocal and redundant components of high-dimensional vectors. The projection of this high-dimensional character to a lower dimensional space is often not handled by the usual statistical tools. Similarly, the dynamical approach, which considers the geometry of the reconstructed time series, has difficulties when the dynamics occurs in more than three dimensions. Further, reconstruction is applicable only when the time series is long and relatively free of noise * Corresponding author Preprint submitted to Journal of Computational and Graphical Statistics November 12, 2015
15

Time Series Analysis via Barcodesissc.uj.ac.za/~vivien/BarcodePaper2015.pdf · Time Series Analysis via Barcodes A.B.C. Balbuenaa, ... Time series, Topological data analysis, Dynamical

Apr 25, 2018

Download

Documents

dangnga
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Time Series Analysis via Barcodesissc.uj.ac.za/~vivien/BarcodePaper2015.pdf · Time Series Analysis via Barcodes A.B.C. Balbuenaa, ... Time series, Topological data analysis, Dynamical

Time Series Analysis via Barcodes

A.B.C. Balbuenaa, R.T. Meltonb, J.A. Nablec, and M.V.V. Visayad,∗

aInstitute of Mathematics, University of the Philippines-DilimanEmail : [email protected]

bEmail : [email protected]

cDepartment of Mathematics, Ateneo De Manila University, PhilippinesEmail : [email protected]

d Department of Pure and Applied MathematicsUniversity of Johannesburg, South Africa

Email : [email protected]

Abstract

Topological data analysis is used to study one-dimensional time series from observed data. The time series

is a set of unorganized sample points (point cloud data) and is reconstructed in 2-dimensional space. The

underlying structure is approximated by constructing simplicial complexes from the reconstruction. We impose

topological structures on the constructed time series and compute the Betti numbers of the persistent homology.

These calculations give us qualitative information which we use to impose structure on a time series. We use

deterministic and random time series and present difference of their topological structures.

Keywords: Persistent homology, Barcode, Time series, Topological data analysis, Dynamical systems, Metric

entropy

Shapes analysis, Statistical modeling

2000 MSC: 37M10, 55N99,

1. Introduction

Empirical data is usually nonlinear, characterized by high-dimensionality, redundancy, and is produced in vast

quantity. Time series data produced by most experimental systems are often corrupted by noise. Different

approaches have been employed to analyze time series data - via statistical, dynamical systems, and complex

networks, among others ([19], [21]). The analysis of such data requires weeding out the equivocal and redundant

components of high-dimensional vectors. The projection of this high-dimensional character to a lower dimensional

space is often not handled by the usual statistical tools. Similarly, the dynamical approach, which considers

the geometry of the reconstructed time series, has difficulties when the dynamics occurs in more than three

dimensions. Further, reconstruction is applicable only when the time series is long and relatively free of noise

∗Corresponding author

Preprint submitted to Journal of Computational and Graphical Statistics November 12, 2015

Page 2: Time Series Analysis via Barcodesissc.uj.ac.za/~vivien/BarcodePaper2015.pdf · Time Series Analysis via Barcodes A.B.C. Balbuenaa, ... Time series, Topological data analysis, Dynamical

[12]. It is in this context that an invariant manifold is approximated by identifying each point of the reconstructed

time series with a cell complex [11].

The methods of algebraic topology provides powerful mathematical tools that allow us to deal with nonlinearity.

Homology calculations on a complex allow us to associate high-dimensional shapes to the time series. The n-th

Betti number of a topological space X, denoted by βn(X), detects the number of n-dimensional holes of X. For

the first three Betti numbers, β0(X) is the number of pieces X is composed of, β1(X) is the number of holes,

and β2(X) is the number of voids. If X ⊂ Rd, then βn(X) = 0 for all n ≥ d. For example, the spherical surface

has Betti numbers β0 = 1, β1 = 0, β2 = 1, and βn = 0 for n ≥ 3.

One aim of computational homology is to use algebraic-topological tools to come up with algorithms for com-

puting topological invariants of a set, given only finitely many points from the set. This new perspective is far

from the standard statistical methods and could complement dynamical systems methods. The past decade has

witnessed the increased applications of algebraic topology to data analysis, e.g. in the areas of image compres-

sion and segmentation, speech pattern analysis, neuroscience, effective coverage in sensor networks, and gene

expression analysis ([4],[5],[7],[9],[22]). For example, at an initial stage in a series of medical diagnosis, in looking

at CT scan image, we would like to distinguish a lump from threads. The information sought is more qualitative

as the size (metric) of the lump is not as important as the presence of the lump at this stage of processing.

The main objective of this paper is to be able to arrive at a plausible topological definition of ‘degree of order’

among a set of time series data, particularly using the idea of persistence [7]. Computing (persistent) Betti

numbers allows us to deduce the ‘shape’, or give shape, to the data. Moreover, comparing the Betti numbers

of a set of reconstructed time series can give us an intuition of such nature of orderliness among time series.

To show the validity of our proposed definition, we compare our calculations with that of metric entropy (or

Kolmogorov-Sinai entropy), equivalent to Shannon’s information theoretic entropy [8], a well-known measure of

order/randomness in a system.

The ideas are fairly intuitive, as shown by the following example. If one wants to find an individual in a room full

of 200 people say, then one’s task will be much easier if the 200 persons are clustered in ten groups than when

they are uniformly scattered across the room. In particular, uncertainty decreases as the occupants organize

themselves according to some social criteria (e.g. age, occupation, familial relations), information usually known

to the person looking for the particular individual. As illustrated in Figure 1, the number of clusters for the set

in (a), (b), and (c) are 1, 6, and 10, respectively. Thus, we say that order is biased towards the set with Betti

number β0 = 10 than towards β0 = 6 and β0 = 1. In the case where β0 is the same value for two sets of data,

degree of order is determined by β1. Similarly, for two sets of data with the same β0 and β1 values, the degree

of order relative to each other will be determined by comparing their β2 values.

Let N0 be the set of nonnegative integers. Consider sequences (xn)n∈N0 and (yn)n∈N0 . We recall the lexicographic

order of sequences. Let ` = min{j : xj 6= yj}. We say that (xn) ≺ (yn) if x` < y`.

2

Page 3: Time Series Analysis via Barcodesissc.uj.ac.za/~vivien/BarcodePaper2015.pdf · Time Series Analysis via Barcodes A.B.C. Balbuenaa, ... Time series, Topological data analysis, Dynamical

Figure 1: Data clustering reducing uncertainty in the location of a particular person in a roomful of people.

Let X ⊂ Rd. Define the concatenation of the first d Betti numbers of X by

[βd(X)] = β0(X)β1(X) . . . βd−1(X).

We propose the following definition of degree of order between topological spaces X and Y .

Definition 1. let X,Y ⊂ Rd. We say that Y has a higher degree of order, or is more orderly than X, if

[βd(X)] ≺ [βd(Y )].

Denote by X,Y, and Z the sets(⊂ R2) in Figure 1(a), 1(b), and 1(c), respectively. From Definition 1, we have

the following

[β2(X)] = (1)(1) ≺ [β2(Y )] = (6)(1) ≺ [β2(Z)] = (10)(2)

because 1 < 6 < 10.

2. Time Series Reconstruction

Although the underlying state space is unknown in experimental one dimensional time series, as a substitute, an

embedded phase space can be reconstructed by delay reconstruction technique. Consider a compact manifold M ,

an unknown dynamical system f : M →M , and observation function g : M → R, where f and g are continuous.

For integer d ≥ 2, denote by Γdg the reconstruction map of the form

Γdg : M → Rd

x 7→ (g(x), g(f(x)), . . . , g(fd−1(x)))

where fd−1 is the composition of f with itself d − 1 times. Let xt = f t(x0) and consider the orbit of x0 ∈ M

defined by O(x0) = {xt}t∈N0 . From the only quantity that is available, i.e., a sequence of real numbers {ut : ut =

g(f(xt)), ut ∈ R}t∈N0 , we construct d-dimensional vectors and obtain the reconstructed set

Ad = {(ut, ut+1, ut+2, . . . , ut+(d−1))}. (1)

We call Ad the point cloud data (PCD) in Rd from {ut}. In general, any finite subset of points from a metric

space is a PCD. For m > 0, denote the m-th lag of Ad by

Adm = {(ut, ut+m, ut+2m, . . . , ut+(d−1)m)},

3

Page 4: Time Series Analysis via Barcodesissc.uj.ac.za/~vivien/BarcodePaper2015.pdf · Time Series Analysis via Barcodes A.B.C. Balbuenaa, ... Time series, Topological data analysis, Dynamical

where m is the lag (or time-delay). The time-delay embedding procedure [20] reconstructs {ut} to obtain an

attractor, particularly a point cloud of lagged coordinate vectors, as in Adm. Given correct d, Takens’ theorem

[20] states that Adm has the same topological properties as the attractor in M .

The selection of m and d in time-delay reconstruction is akin to adjusting the light source and zoom lens in

a simple microscope, allowing a clearer picture of the object we are interested in. A brute force search for

the correct combination of m and d for an attractor is computationally unappealing and crude. The approach

suggested in [18] is to fix m = 1 and estimate d. If the embedding dimension does not fall neatly into the range

3-5, another value for m is chosen. We take this lead in our analysis on time series data. We embed our time

series data into manifolds of successive dimensions, obtaining point cloud data from which filtered complexes are

constructed. Persistent homology calculations are then performed via Smith Normal Forms of the matrices of

the boundary operators (see below). Takens’ theorem is thus replaced by the methods of persistent homology.

The true topological properties of the point cloud are those that persist along the filter.

Definition 2. Let X and Y be one-dimensional time series with respective reconstructed sets AdX and AdY in

Rd, as in (1). We say that Y has a higher degree of order, or is more orderly than X, if

[β2(A2X)][β3(A3

X)] · · · ≺ [β2(A2Y )][β3(A3

Y )] · · · .

3. Simplicial Complexes and Homology Groups

We begin by defining abstract simplicial complexes. Any ordered set [x0, x1, . . . , xn] determines an oriented

n-simplex, which we denote by σn. The elements xi of σn are called vertices of σn and n is its dimension. Any

q-element (q < n) subset of σn is called a q-face of σn. An (abstract) simplicial complex K is a finite collection

of simplices that is closed under formation of subsets and intersection, satisfying the following two conditions:

(i) Any face of a simplex in K is also in K.

(ii) The intersection of any two simplices in K is either empty or is a face of both simplices.

The dimension of a simplicial complex is the highest dimension of the simplices comprising it. If a subset K0 ⊆ K

is a simplicial complex, then it is a subcomplex of K. Simplicial complexes possess algebraic, topological, and

combinatorial properties that make them particularly convenient for modeling complex structures. For a good

introduction to homology theory, we refer the reader to [13].

In this paper, we use simplicial homology with coefficients in Z2 = {0, 1}. Given a simplicial complex K, an

n-chain is a linear combination of n-simplices in K, i.e.

c =∑σ

aσσ,

where aσ ∈ Z2, and σ is an n-simplex in K. By definition, the set of n-chains is in one-to-one correspondence

with the set of subsets of n-simplices since the field of coefficients is Z2. If we define the addition of chains as

the addition of these vectors (mod 2), then all the n-chains form an abelian group, denoted by Cn(K).

4

Page 5: Time Series Analysis via Barcodesissc.uj.ac.za/~vivien/BarcodePaper2015.pdf · Time Series Analysis via Barcodes A.B.C. Balbuenaa, ... Time series, Topological data analysis, Dynamical

The collection of (n− 1)-dimensional faces of an n-simplex σ is called the boundary of σ, denoted by ∂n(σ). For

σ = [x0, . . . , xn], the boundary map ∂n : Cn(K)→ Cn−1(K) is given by

∂n(σ) =

n∑i=0

(−1)i[x0, . . . , x̂i, . . . , xn], (2)

where x̂i indicates that xi does not appear. The boundary of an n-chain is the sum of the boundaries of the

n-simplices in the chain. For n ∈ N0, the map ∂n connects the chain of groups into a chain complex:

∅ → Cn(K)∂n−→Cn−1(K)

∂n−1−−−→Cn−2(K)→ . . .→ C0(K)→ ∅

where ∅ is the trivial group, and where ∂n−1∂n = 0 (the zero map) for all n. An n-cycle φ ∈ Cn(K) is an n-chain

satisfying ∂n(φ) = 0, meaning it has an empty topological boundary. An n-boundary is an n-chain ϕ ∈ Cn(K)

such that ϕ = ∂n+1(φ) for some φ ∈ Cn+1(K). Denote the collection of all n-cycles by Zn(K), and the collection

of all n-boundaries by Bn(K). By the linearity of the boundary operator, Zn(K) is a subgroup of Cn(K). Also,

since ∂n−1∂n = 0, Bn(K) is a subgroup of Zn(K). As Cn(K) is abelian, Bn(K) is normal in Zn(K). The n-th

homology group of K is defined by the quotient group

Hn(K) = Zn(K)/Bn(K).

The contrapositive of the following theorem [13] allows us to tell whether two topological spaces are not homeo-

morphic.

Theorem 1. If X and Y are homeomorphic topological spaces, then there is an isomorphism of homology groups

Hn(X) ∼= Hn(Y ).

As the homology groups are finitely generated abelian groups, then the following are isomorphic

Hn(X) ∼= Z× · · · × Z× Zm1 × · · · × Zmn .

The number of times that Z appears in Hn(X) is called the n-th Betti number of X. Computing the Betti

number is by means of the Smith normal form of matrices.

4. The Smith Normal Form

We consider the Smith normal form for a matrix representation of the boundary operator ∂n : Cn(K)→ Cn−1(K).

Because ∂n is a linear mapping, and that the set of ordered n-simplices form a basis of Cn(K), it is possible

to write ∂n as a matrix [∂n] with entries from the set {0, 1}. Consider the oriented triangle [a, b, c] and the

ordered bases {[a, b], [b, c], [a, c]} and {a, b, c}. Using (2) we have ∂2([a, b, c]) = [b, c] − [a, c] + [a, b], ∂1([a, b]) =

b− a, ∂1([b, c]) = c− b, and ∂1([a, c]) = c− a. Thus, the matrix representations of ∂2 and ∂1 are

[∂2] =

1

1

1

and [∂1] =

1 0 1

1 1 0

0 1 1

5

Page 6: Time Series Analysis via Barcodesissc.uj.ac.za/~vivien/BarcodePaper2015.pdf · Time Series Analysis via Barcodes A.B.C. Balbuenaa, ... Time series, Topological data analysis, Dynamical

respectively. The algorithm to reduce an integer matrix [∂n] to its Smith normal form [∂̃n] is by modified Gaussian

elimination, where at each stage, the entries remain integers. The reduced matrix has the form

[∂̃n] =

Rk 0

0 0

where Rk is a diagonal matrix diag(r1, . . . , rk) with property that each ri divides ri+1.

The Smith normal form of the matrices of ∂n+1 and ∂n determine Hn completely. The torsion coefficients of Hn

are the diagonal entries of [∂̃n+1] that are greater than 1, while the rank of Zn is the number of zero columns of

[∂̃n]. The rank of the boundary group Bn is the number of nonzero rows of [∂̃n+1]. Via the rank-nullity theorem

[REFERENCE!], the n-th Betti number is given by

βn = rank(Zn)− rank(Bn).

5. Persistent Homology

Turn PCD into a space with structure (complex, i.e witness complex).

Witness complexes produces nested family of complexes, which allows us to compute persistent Betti numbers.

Consider an annulus with outer radius equal to the radius of a given circle. The first homology group distinguishes

these objects by their number of holes. Suppose we only have a PCD sampled from each object, possibly with

noise. Depending on the density and accuracy of the sample, and on the relative sizes of the inner and outer

radii of the annulus, these PCDs might look very similar, or quite distinct. Identifying which PCD came from

the circle, and which PCD came from the annulus is called manifold learning [2].

In general, PCDs are finite, usually a large, set of points that do not have any interesting topology. One method

of addressing this problem is to replace each point with a small ball of radius ε. The result depends strongly on

the choice of the parameter. If ε were small, say one percent of the radius of the circle, then the union of the

ε balls would look like a disconnected set of discrete points. On the other hand, the union of the ε balls would

look like a large connected blob if ε were large, say ninety-nine percent of the circle’s radius. In both cases, we

could not distinguish which of the data came from the annulus or the circle. Ideally, we want ε such that the

unions of ε-balls in the annulus contains a hole, and none in the circle. Instead of finding the correct value of ε,

persistent homology considers a range of ε-values. Topological features that persist for a wide range of ε-values

are considered to be actual features of the data set, while those that have short lives are considered topological

noise [3].

Definition 3. A finite simplicial complex K is filtered if K is the union of an increasing sequence of subcomplexes,

i.e.

K0 ⊆ K1 ⊆ . . . , K =⋃l

Kl.

6

Page 7: Time Series Analysis via Barcodesissc.uj.ac.za/~vivien/BarcodePaper2015.pdf · Time Series Analysis via Barcodes A.B.C. Balbuenaa, ... Time series, Topological data analysis, Dynamical

Algebraically, persistence is how long a homology class persists along a filtration of a topological space. In the

annulus and circle example above, the topological space is approximated by the simplicial complexes in the

filtration, where the complexes are built up from the PCD sampled from the topological spaces. The filtration

is built up by gradually increasing the values of ε. The inclusions along the filtration induce corresponding

inclusions at the level of chains, cycles, and boundaries. That is, the inclusions Kl ⊆ Kl+1 induce the inclusions

of n-chains Cn(Kl) ⊆ Cn(Kl+1), of n-cycles Zn(Kl) ⊆ Zn(Kl+1), and of n-boundaries Bn(Kl) ⊆ Bn(Kl+1). In

particular, the following diagram commutes:

// Cn(Kl−1) // Cn(Kl) // Cn(Kl+1) //

C C C

// Zn(Kl−1) // Zn(Kl) // Zn(Kl+1) //

C C C

// Bn(Kl−1) // Bn(Kl) // Bn(Kl+1) //

The normality of the boundary groups in their respective cycle groups induces homomorphisms at the level of

homology, that is,

// Hn(Kl−1) // Hn(Kl) // Hn(Kl+1) // (3)

Cycles in Zn(Kl) which are not in the subgroup Bn(Kl+p) ∩ Zn(Kl), i.e. the nonbounding cycles in Zn(Kl)

which are also nonbounding in Zn(Kl+p), correspond to the nontrivial cosets of the following quotient group:

Hpn(Kl) =

Zn(Kl)

Bn(Kl+p) ∩ Zn(Kl).

This quotient group is the p-persistent n-th homology group of Kl.

The p-th persistent n-th homology group of a filtered simplicial complex essentially counts the number of non-

bounding cycles in a subcomplex that will remain nonbounding for a given interval in the filtration. Recall that

it is the dimension of the homology groups that gives the Betti numbers of the complex..

6. Landmark Sets and JPlex

Consider a PCD Ad ⊂ Rd from a time series, as in (1). We give a topology to Ad by considering a simplicial

complex approximation to it, called the witness complex. A smaller subset of Ad is used in constructing a witness

complex. This set is called a landmark set and will be denoted by L.

Let |V | be the cardinality of a set V . Given a PCD, we choose the number of landmark points to be |L| ≥

.05|PCD|, as suggested in [6]. In choosing L, we use the maxmin algorithm [16]. The first point `0 ∈ Ad is

randomly selected and the next points are selected inductively. Suppose the set of the first i − 1 landmarks

points Li−1 = {`0, `1, . . . , `i−2} have been selected. Choose `i−1 ∈ Ad\Li−1 which maximizes the function

x 7→ ρ(x,Li−1)

where ρ(x,Li−1) is the distance between x and Li−1.

7

Page 8: Time Series Analysis via Barcodesissc.uj.ac.za/~vivien/BarcodePaper2015.pdf · Time Series Analysis via Barcodes A.B.C. Balbuenaa, ... Time series, Topological data analysis, Dynamical

(a) (b)

Figure 2: (a) The PCD(⊂ R3) reconstructed from a human-generated time series. (b) The much more sparse set of 40 landmark

points of the PCD in (a).

For example, in the PCD illustrated in Figure 2(a), not all points will be considered in the construction of a

witness complex. Rather, only a set of landmark points as in Figure 2(b) is chosen. Once L is chosen, a witness

complex is constructed. We give a definition following [1].

Definition 4. Let X ⊂ Rd be a PCD, and let L ⊂ X be a landmark set. For ε > 0, the witness complex

W (X,L, ε) is such that

(i) the vertex set is L,

(ii) for n > 0 and `i ∈ L, the n-simplex [`0, `1, . . . , `n] is in W (X,L, ε) if all of its faces are in W (X,L, ε), and

(iii) there exists a point x ∈ X such that

max{ρ(x, `0), ρ(x, `1), . . . , ρ(x, `n)} ≤ ε+mn(x),

where mn(x) is the distance of x ∈ X to its (n + 1)-th nearest landmark point. A point x in (iii) is called a

witness to the existence of the simplex.

Given point cloud Z and landmark subset L, we define R = maxz in Zd(z, L). Number R reflects how finely the

landmarks cover the dataset. We often use it as a guide for selecting the maximum filtration value tmax for a

witness or lazy witness stream

To compute the Betti numbers of the witness complex associated to L, we use the barcodes generated by the

software package JPlex [16]. A barcode is a graph comprised of a collection of horizontal intervals, as illustrated

in several of the figures to follow. Each interval in the barcode corresponds to a non-trivial homology class in

one of the persistent homology groups. The length of an interval in the barcode is directly proportional to the

number of consecutive complexes in the filtration for which the corresponding homology class remains nontrivial.

For 0 < ε′ ≤ ε, the software JPlex computes the witness complexes W (X,L, ε′) all at once, tracks the persistence

of the n-dimensional holes, and encodes this persistence in the form of a barcode. The short intervals in a

barcode are interpreted as artifacts of noise, while the long intervals correspond to real topological features of

8

Page 9: Time Series Analysis via Barcodesissc.uj.ac.za/~vivien/BarcodePaper2015.pdf · Time Series Analysis via Barcodes A.B.C. Balbuenaa, ... Time series, Topological data analysis, Dynamical

the underlying structure. The persistent Betti numbers are then read off of the barcode as the number of long

intervals [7].

(a)

(b)

(c)

Figure 3: Barcodes from 50 landmark points taken from a noisy torus. Corresponding to (a), (b), and (c), are the persistent Betti

numbers β0 = 1, β1 = 2, and β2 = 1, respectively.

As an illustration, consider 1,000 points from a noisy torus. Taking 50 landmarks points, the Betti numbers

associated to the barcodes illustrated in (a), (b), and (c) of Figure 3 are β0 = 1, β1 = 2, and β2 = 1, respectively.

For n > 2, βn = 0. These values agree with the Betti numbers of a torus. Observe how the persistent Betti

numbers can change if holes are either filled in or created as simplices are added to the approximating simplicial

complex.

7. Metric Entropy

The metric (Kolmogorov-Sinai) entropy of a measure-preserving dynamical system (X,B, µ, T ) measures the

randomness of its orbit structure. Consider a finite partition α of the phase space X into k pairwise-disjoint

bins A1, . . . , Ak. For two partitions α = {A1, . . . , An} and β = {B1, . . . , Bm}, their least common refinement,

denoted by α ∨ β, is given by {Ai ∩ Bj |1 ≤ i ≤ n, 1 ≤ j ≤ m}. For n ∈ N0, applying T−n to α produces a

partition {T−nA1, . . . , T−nAn}. One can interpret the finite partition of X as outcomes of an experiment. Now

if applying T−n is thought as passage of time, ∨n−1i=0 T−iα is interpreted as performing the experiment α on n

consecutive time periods. The entropy of a finite partition α of the system (X,B, µ, T ) is given by

I(α) = −n∑i

µ(Ai) logb(µ(Ai)).

Given an arbitrary point x ∈ X, the entropy of a partition measures the uncertainty in which bin x will belong

to. We then get the entropy of the transformation T with respect to a given partition α. It is given by

I(T, α) = limn→∞

1

nI(∨n−1i=0 T

−iα).

9

Page 10: Time Series Analysis via Barcodesissc.uj.ac.za/~vivien/BarcodePaper2015.pdf · Time Series Analysis via Barcodes A.B.C. Balbuenaa, ... Time series, Topological data analysis, Dynamical

This value gives the average information per time period when performing the experiment forever. Finally, the

metric entropy of the system is given by

I(T ) = supαI(T, α),

where the supremum is taken over all finite partitions α in B. The metric entropy is a measure of the maximum

information gained per time period by performing a finite experiment.

Theorem 2. Let (X,B, µ, T ) be a measure-preserving dynamical system. Given a sequence {αi} of increasingly

finer partitions such that ∨∞n=1αi = B,

I(T ) = limn→∞

I(T, αn)

When interpreted as performing an experiment, entropy is uncertainty in an experiment or, equivalently, the

information learned from an outcome of the experiment. Systems with higher entropy are systems with more

’disorder’ or more random [8]. Properties of KS-entropy arise from the information-theoretic roots of entropy

as defined by Shannon. Kolmogorov defined entropy for measure-preserving systems, which was later formalized

by Sinai[17]. The following gives the definitions and result that guarantee the existence of metric entropy.

We use the algorithm of K. Short [17] to compute the metric entropy of time series data.

1. Normalize the time series into values between 0 and 1.

2. Choose a partition α = {A0, . . . , Ak−1} of [0, 1], and call element Aj ∈ α a bin.

3. LetN the length of the time series. If xi ∈ Aji (ji ∈ {0, 1, . . . , k−1}), express the time series x0, x1, . . . , xn, . . . , xN

as a the string (of bins) Aj0 , Aj1 , . . . , Ajn , . . . , AjN−1 (main sequence).

4. Consider a0, a1, . . . , an−1 (n ≤ N), a sequence of bins of α (not necessarily distinct).

Set n = 1, the length of the substring.

5. Compute

p(a0, a1, . . . , an−1) =number of times the string a0, a1, . . . , an−1 appears in the main sequence

number of strings of length n in the main sequence,

the observed probability that the string a0, a1, . . . , an−1 appears in the time series.

6. Compute

In(α) = I(∨n−1i=0 T−iα)

= −∑

a0,a1,...,an−1

p(a0, a1, . . . , an−1)logb(p(a0, a1, . . . , an−1))

7. Compute separation level defined by

sep =number of distinct substrings of length n in the main sequence

number of strings of length n in the main sequence

10

Page 11: Time Series Analysis via Barcodesissc.uj.ac.za/~vivien/BarcodePaper2015.pdf · Time Series Analysis via Barcodes A.B.C. Balbuenaa, ... Time series, Topological data analysis, Dynamical

8. If sep≥ 0.2, stop. Else, increase n by 1, and do Steps (5)-(8).

9. Plot In vs n. Slope of best-fit line estimates I(T, α) = limn→∞In(α)n

10. Choose a finer partition, i.e. more bins. Do Steps (3) - (9).

11. Entropy I(T ) is maximum of computed values I(T, α)

Clearly, the algorithm can be used even when the system is not known explicitly, and only know a finite trajectory

of the system (Kahrizsangi). The metric entropy is determined from the embedded time series data by finding

points on the trajectory that are close together in phase space but which occurred at different times (i.e. are not

time correlated). Metric entropy is a reflection of how well the behavior of each respective part of the trajectory

from the other are predicted. Higher entropy means it is less predictable and is a step closer to stochasticity [H.

Kantz et al., 1997].

8. Results

We analyze three sets of normalized data reconstructed in the unit square in Rd. The reconstructed sets will be

the point clouds of interest. From the n-th Betti numbers that correspond to barcodes calculated from the PCD

of a time series, we may be able to describe the shape of the time series via the number of n-dimensional holes.

This will allow us to give a comparison among the sets of time series. Persistence is applied to the PCD of each

data for d = 2 and d = 3. We also compute the metric entropy of the time series of each data set.

8.1. Calendar Data

From a calendar year, we consider a sequence of numbers constructed by taking all numbers that fall on a

Monday, followed by all numbers that fall on a Tuesday, and so on. Figure 4(a) illustrates the PCD from the

calendar data for d = 2, with a landmark set in Figure 4(b). Figure 4(c) is the PCD for d = 3. Figure 7(a)

shows the corresponding barcode for β0.

8.2. Magnetoelastic Ribbon Data

The magnetoelastic ribbon is a thin strip of magnetic material whose shape can be changed by applying a

magnetic field to it. The parameters are the strength of the applied uniform field Hdc and the strength and

frequency f of the applied oscillating field Hac in the vertical direction. With parameters Hdc=2212.45mV,

Hac=3200mV, and f=0.95 Hz, the data is the position of the ribbon once per driving period and consists of 1,000

points from voltage readings on a photonic sensor [15]. The PCDs for d = 2 and d = 3 of the (magnetoelastic)

ribbon data are shown in Figure 5. For d = 2, we have β0 = 2, as shown in the associated barcode in Figure

7(b).

11

Page 12: Time Series Analysis via Barcodesissc.uj.ac.za/~vivien/BarcodePaper2015.pdf · Time Series Analysis via Barcodes A.B.C. Balbuenaa, ... Time series, Topological data analysis, Dynamical

(a) (b) (c)

Figure 4: (a) Consecutive pairs of numbers taken from the calendar data and normalized to fit within the unit square. (b) A

landmark set of the PCD in (a). (c) The PCD from the calendar data for d = 3.

(a) (b)

Figure 5: The PCD from the ribbon data for (a) d = 2 and (b) d = 3.

8.3. Generated Data

In [10], it has been shown that individually, humans cannot generate random numbers. But by alternately

stitching the numbers generated by at least seven individuals (i.e., collect all first numbers generated, followed

by all second numbers generated, etc.), a random sequence is produced. Our data is composed of 1,000 points

from 100 random numbers taken from each of ten individuals. For d = 2, any landmark set from the PCD of

the generated data gives β0 = 1, as shown in Figure 7(c). However, β1 value is either 1 or 2, with associated

barcodes illustrated in Figure 6.

(a) (b)

(c)

Figure 6: Several barcodes associated to β1 of the PCD(⊂ R2) from the human-generated data set.

12

Page 13: Time Series Analysis via Barcodesissc.uj.ac.za/~vivien/BarcodePaper2015.pdf · Time Series Analysis via Barcodes A.B.C. Balbuenaa, ... Time series, Topological data analysis, Dynamical

8.4. Summary of the βn and entropy values

A summary of the Betti numbers of the PCDs for the three data sets for d = 2 and d = 3 is given in Table 1.

A comparison shows that the results for d = 3 are not simply trivial extensions of the results for d = 2. As the

parameter ε increases, the topological feature of three connected components for both d = 2 and d = 3 of the

calendar persists until the specified threshold. By Definition 1, we say that the calendar data is more orderly

among the three data sets, followed by the ribbon data.

d = 2 d = 3

PCD β0 β1 β0 β1 β2 metric entropy

Calendar 3 2 3 2 0 0.57

Ribbon 2 0 1 0 0 1.21

Generated 1 1 or 2 1 0 1 or 2 3.23

Table 1: Persistent Betti numbers of the PCD from each of the three examples.

(a)

(b)

(c)

Figure 7: The barcode associated to β0 of the PCD from the (a) calendar data, (b) ribbon data, and (c) generated data for d = 2.

Figure 8 shows length of string size n versus metric entropy In, computed for a fixed partition α (denoted by a

distinct color). The slope of best-fit line estimates I(T, α) = limn→∞

In(α)

n.

(a) (b) (c)

Figure 8: Length of string size n versus metric entropy In for (a) calendar data (b) ribbon data (c) generated data.

13

Page 14: Time Series Analysis via Barcodesissc.uj.ac.za/~vivien/BarcodePaper2015.pdf · Time Series Analysis via Barcodes A.B.C. Balbuenaa, ... Time series, Topological data analysis, Dynamical

9. Conclusions

This paper has presented an approach to time series analysis by using an algebraic topological classification to

compare the orderliness of time series. The classification is readily computable. In particular, Betti numbers are

used to give an ordering among data sets. The higher β0 value detected in the calendar data seems to imply

that it is more orderly than the other two data sets. This result is expected, as the numbers in the calendar data

indeed possess an orderly character (i.e., 1 is always followed by 8, 2 is always followed by 9, etc.). Also from

the summary, we say that the ribbon data is more orderly than the human-generated data. The metric entropy

of the data sets corroborate the topological calculations. Thus, as a complement to the usual measure of order

and randomness, we are able to propose another that is readily computed. As such measures are seldom used for

time series, our definition should justify associating measures of order and randomness to time series. In fact,

persistent homology does more. As we have demonstrated, persistent, hence correct, topological structures can

be associated to time series.

A strength of the methods presented here is that they are relevant to the task of studying qualitative aspects of

time series data, in that, not only do they say something about the topology of the spaces from which the data

are sample points. They also, in the absence of any apparent structure, impose the topology on the data set.

References

[1] H. Adams, JPlex with Matlab Tutorial,

URL http://comptop.stanford.edu/u/programs/jplex/files/PlexMatlabTutorial.pdf.

[2] R. Adler, O. Bobrowski, et al., Persistent Homology for Random Fields and Complexes, Borrowing Strength:

Theory Powering Applications, A Festschrift for Lawrence D. Brown, IMS Collections 6, 2010.

[3] L. Balzano and J. Ellenberg, Understanding Persistent Homology and Plex Using a Networking Dataset,

Technical Report, October 2010.

[4] P. Bendich, Analyzing Stratified Spaces Using Persistent Versions of Intersection and Local Homology, PhD

Dissertation, Department of Mathematics, Duke University, 2009.

[5] G. Carlsson, Topology and Data, Bulletin of the American Mathematical Society 46 (2009) 255-308.

[6] G. Carlsson and V. de Silva, Topological Estimation Using Witness Complexes, Eurographics Symposium on

Point-based Graphics, Zurich (2004) 157-166.

[7] H. Edelsbrunner, D. Letscher, and A. Zomorodian, Topological Persistence and Simplification, Discrete &

Computational Geometry 28 (2002) 511-533.

[8] R. Frigg, In What Sense is the Kolmogorov-Sinai Entropy a Measure for Chaotic Behaviour? - Bridging the

Gap Between Dynamical Systems Theory and Communication Theory, British J. for the Phil. of Science 55

(2004) 411-434.

14

Page 15: Time Series Analysis via Barcodesissc.uj.ac.za/~vivien/BarcodePaper2015.pdf · Time Series Analysis via Barcodes A.B.C. Balbuenaa, ... Time series, Topological data analysis, Dynamical

[9] R. Ghrist, Barcodes: The Persistent Topology of Data. Bull. Amer. Math. Soc. 45 (2008) 61-75.

[10] A. Longjas, I. Crisologo, C. Monterola, and E. F. Legara, A Procedure for Generating a Random Number

Sequence Using at Least Seven Individuals, Proceedings Samahang Pisika ng Pilipinas (Proceedings of the

26th National Physics Congress of the Philippines), 2008.

[11] M. R. Muldoon, R. S. MacKay, J. P. Huke and D. S. Broomhead, Topology from Time Series, Physica D

65 (1993) 1-16.

[12] D. Sciamarella and G. B. Mindlin, Topological Structure of Chaotic Flows from Human Speech Chaotic

Data, Phys. Rev. Letters 82 (1999) 1450-1453.

[13] Munkres, James R, Elements Of Algebraic Topology, Prentice Hall, 1984.

[14] D. Mumford, A. Lee, and K. Pedersen, The Nonlinear Statistics of High-contrast Patches in Natural Images,

Intl. J. Computer Vision 54 (2003) 83-103.

[15] J. Reiss, The Analysis of Chaotic Time Series, Ph.D. Thesis, Georgia Inst. Tech., Atlanta, 2001.

[16] H. Sexton, M. Johansson, JPlex software package, URL http://comptop.stanford.edu/programs/jplex.

[17] , K. Short, Direct Calculation of Entropy from Time Series, J. of Computational Physics 104 Issue 1, (1993)

162-172.

[18] M. Small, Applied Nonlinear Time Series Analysis: Applications in Physics, Physiology and Finance, Non-

linear Science Series A 52 World Scientific, 2005.

[19] M. Small, J. Zhang and X. Xu, Transforming Time Series into Complex Networks, Complex Sciences 2

(2009) 2078-2089.

[20] F. Takens, Detecting Strange Attractors in Fluid Turbulence, Symposium on Dynamical Systems and Tur-

bulence, Springer Lecture Notes in Mathematics 898 (1981) 366-381.

[21] H. Tong, Non-Linear Time Series: A Dynamical System Approach, Oxford Statistical Science Series, Ox-

ford:Clarendon Press, 1990.

[22] A. Zomorodian, G. Carlsson, Computing Persistent Homology, Discrete & Computational Geometry 33

(2005) 249-274.

[23] B. D. Fulcher, M. A. Little, N. S. Jones., ”Highly comparative time-series analysis: the empirical structure

of time series and their methods”, J. Roy. Soc. Interface 10(83) 20130048 (2010). DOI: 10.1098/rsif.2013.0048

15