Top Banner
GESTS Int’l Trans. Computer Science and Engr., Vol.20, No.1 29 GESTS-Oct.2005 Using Cumulative Weighted Slopes for Clustering Time Series Data Durga Toshniwal 1 and R. C. Joshi 1 1 Department of Electronics & Computer Engineering, Indian Institute of Technology Roorkee, Roorkee – 247 667 Uttaranchal, India {durgadec, joshifcc}@iitr.ernet.in Abstract. Clustering time series data is an important mining activity in various domains. In this paper we propose a novel approach for clustering time series data based on cumulative weighted slopes. This technique is based on the ob- servation that similar time sequences would have similar slopes at their corre- sponding points. The weighted sum of these slopes is called the cumulative weighted slope and is computed for each time sequence. Clusters are formed on the basis of this weighted sum of slopes to identify similar patterns over periods of time. 1 Introduction Mining time series data for clusters has been an area of active research for the last few decades. It has immense applications in various domains. Examples of some application domains include finance and banking, retail sales, weather forecasting and agriculture. The problem of clustering is interdisciplinary in nature and has been addressed in different contexts by researchers working in a variety of areas such as data mining, statistics, and information systems. Much work is being done on clustering time series data [1], [2], [3], [4], [5]. Clus- tering can be used as a stand alone tool for analyzing data, and may also be used as a pre-processing step in other data mining algorithms [6], [7]. Clustering methods can be classified in a number of ways [8] [9]. Clustering can also be broadly categorized as whole sequence clustering and sub- sequence clustering [10]. The whole sequence clustering deals with grouping of simi- lar time series into the same cluster. Whereas, the subsequence clustering uses a slid- ing window to extract subsequences from the given time series and then performs clustering on them. Due to an increased interest on streaming time series data, most of the work on clustering of time series data is based on subsequence clustering [1], [2], [3], [4], [5], [6], [7], [10], [11]. Keogh et al. claim in [10] that clustering streaming time series data is completely meaningless. In this paper, we suggest a novel approach for clustering time series data which is based on whole sequence clustering. In our method, the feature extraction from time
12

Using Cumulative Weighted Slopes for Clustering Time ... · Using Cumulative Weighted Slopes for Clustering Time Series Data Durga Toshniwal1 and R. C. Joshi1 1 Department of Electronics

Jul 18, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Using Cumulative Weighted Slopes for Clustering Time ... · Using Cumulative Weighted Slopes for Clustering Time Series Data Durga Toshniwal1 and R. C. Joshi1 1 Department of Electronics

GESTS Int’l Trans. Computer Science and Engr., Vol.20, No.1 29

GESTS-Oct.2005

Using Cumulative Weighted Slopes for Clustering Time Series Data

Durga Toshniwal1 and R. C. Joshi1

1 Department of Electronics & Computer Engineering, Indian Institute of Technology Roorkee,

Roorkee – 247 667 Uttaranchal, India

{durgadec, joshifcc}@iitr.ernet.in

Abstract. Clustering time series data is an important mining activity in various domains. In this paper we propose a novel approach for clustering time series data based on cumulative weighted slopes. This technique is based on the ob-servation that similar time sequences would have similar slopes at their corre-sponding points. The weighted sum of these slopes is called the cumulative weighted slope and is computed for each time sequence. Clusters are formed on the basis of this weighted sum of slopes to identify similar patterns over periods of time.

1 Introduction

Mining time series data for clusters has been an area of active research for the last few decades. It has immense applications in various domains. Examples of some application domains include finance and banking, retail sales, weather forecasting and agriculture. The problem of clustering is interdisciplinary in nature and has been addressed in different contexts by researchers working in a variety of areas such as data mining, statistics, and information systems.

Much work is being done on clustering time series data [1], [2], [3], [4], [5]. Clus-tering can be used as a stand alone tool for analyzing data, and may also be used as a pre-processing step in other data mining algorithms [6], [7]. Clustering methods can be classified in a number of ways [8] [9].

Clustering can also be broadly categorized as whole sequence clustering and sub-sequence clustering [10]. The whole sequence clustering deals with grouping of simi-lar time series into the same cluster. Whereas, the subsequence clustering uses a slid-ing window to extract subsequences from the given time series and then performs clustering on them. Due to an increased interest on streaming time series data, most of the work on clustering of time series data is based on subsequence clustering [1], [2], [3], [4], [5], [6], [7], [10], [11]. Keogh et al. claim in [10] that clustering streaming time series data is completely meaningless.

In this paper, we suggest a novel approach for clustering time series data which is based on whole sequence clustering. In our method, the feature extraction from time

Page 2: Using Cumulative Weighted Slopes for Clustering Time ... · Using Cumulative Weighted Slopes for Clustering Time Series Data Durga Toshniwal1 and R. C. Joshi1 1 Department of Electronics

30 Using Cumulative Weighted Slopes for Clustering Time Series Data

GESTS-Oct.2005

series data is done using cumulative weighted slopes. Cumulative weighted slope can be defined as the sum of the weighted slopes of the given time sequence computed on a point-to-point basis. The parameters representing the cumulative weighted slopes for various time sequences are then grouped into clusters using k-means clustering method to identify similar patterns. In this paper, we assume that a time series con-sists of a sequence of real numbers which represent the values of a measured parame-ter at equal but finite intervals of time. We first demonstrate the effectiveness of our approach by applying it to synthetic time series data consisting of a variety of similar and reverse shaped curves. Next, we demonstrate its application by taking real life case data on retail sales. This data is collected on a monthly basis over a period of eleven years from retail chain stores in USA. The retail sales data has been chosen as the case data in our paper due to the growing importance of time series data mining for the retail industry. Clustering of the retail sales data reveals similarity in the buy-ing patterns of some common retail items. Such information can be used as an impor-tant tool by the retailers to enhance their sales by designing effective marketing strategies, optimizing inventory and efficiently using shelf space.

The rest of the paper is organized as follows. Section 2 briefly gives background and related work. In Section 3, we describe the proposed approach. Section 4 gives some experimental results and in Section 5 we discuss the case study on real life retail time series data. Finally the conclusions and future work are covered in Section 6.

2 Background and Related Work

For clustering time series data, we need to perform feature extraction from the time series data and then apply some clustering technique to the feature vector. In this section, we briefly discuss some key approaches for performing clustering and feature extraction from time series data.

2.1 Feature Extraction from Time Series Data

There has been an explosion of interest on feature extraction from time series data [12], [13], [14]. So far, a variety of approaches have been suggested for deriving feature vector from time series data. Most of these techniques rely on dimension re-duction for mapping the high dimensional time series data to a lower dimensional space. The transformed data is then used for efficient indexing and retrieval purposes.

Agrawal et al. [12] used the Discrete Fourier Transform (DFT) for deriving the feature vector from the time series data. The DFT was used to map the time se-quences to the frequency domain. Chan et al. [13] proposed to use the Discrete Wave-let Transform (DWT) in place of DFT for feature extraction from time series data. Unlike the DFT which misses the time localization of sequences, the DWT allows time as well as frequency localization concurrently. A data dependent scheme for feature extraction was proposed in [14] and is known as the Singular Value Decom-position (SVD) method for feature extraction.

Page 3: Using Cumulative Weighted Slopes for Clustering Time ... · Using Cumulative Weighted Slopes for Clustering Time Series Data Durga Toshniwal1 and R. C. Joshi1 1 Department of Electronics

GESTS Int’l Trans. Computer Science and Engr., Vol.20, No.1 31

GESTS-Oct.2005

In this paper, we introduce a new technique for feature extraction from time series data using cumulative weighted slopes. By cumulative slopes we mean the summa-tion of slopes of the time sequences at corresponding points. Further these slopes have been assigned weights depending on their locations along the time axis. This helps to exaggerate the similarity (dissimilarity) of trends in our approach. The pro-posed approach works well in the presence of variable length time sequences in the databases. It can handle time as well as amplitude scaling and different baselines for the time sequences in the given database.

2.2 Clustering

Clustering is the grouping of unlabeled data such that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. So far, much work has been done on the different approaches for per-forming clustering [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]. One of the most widely used clustering methods is the hierarchical clustering method [4]. However its appli-cation is limited to relatively small datasets [9] due to the fact that its time complexity is O(n 2 log n). Here n is the number of tuples or objects in the given database.

The k-means clustering method is more suitable for large datasets [9]. It was first introduced by MacQueen in [15]. This clustering algorithm falls in the category of partitioning approach for clustering data.

The k-means algorithm partitions a set of n data objects into k clusters so that the resulting intracluster similarity is high whereas the intercluster similarity is low. The number of clusters k has to be fixed apriori. Initially, the k-means algorithm randomly selects k of the data objects as the cluster means or centers. It then computes the new mean for each cluster. The process iterates until the criterion function converges. The criterion function is defined as:

∑ ∑= ∈= k

1i Cp

2

i ip - mE | | (1)

Here E is the sum of square-error for all objects in the database, p is the point in space representing a given object, and mi is the mean of the cluster Ci.

In our paper, we have chosen to use the k-means clustering algorithm for analyzing the time series data due to its almost linear time complexity for even datasets of large sizes. The complexity of the k-means algorithm for N objects is given by O(kNrD) [10] where k is the number of clusters specified by the user, r is the number of itera-tions until convergence, and D is the dimensionality of the time series. In order to reduce the complexity, we can reduce N or D. It may always not be possible to reduce N so the best possible method for reducing the complexity is to reduce D by using an efficient representation technique for the time series data.

In our approach, each time sequence is represented as a number which is obtained by the weighted sum of the slopes of the time sequence at certain points. As a result, the dimensionality of the time series data has been reduced to a constant equal to 1. Thus for all practical purposes, the complexity of the k-means algorithm gets reduced to O(kNr). Thus we have achieved a substantial improvement in the complexity of the k-means algorithm.

Page 4: Using Cumulative Weighted Slopes for Clustering Time ... · Using Cumulative Weighted Slopes for Clustering Time Series Data Durga Toshniwal1 and R. C. Joshi1 1 Department of Electronics

32 Using Cumulative Weighted Slopes for Clustering Time Series Data

GESTS-Oct.2005

3 Proposed Approach

In this paper, we suggest a simple and novel approach for clustering time sequences which is based on whole sequence clustering. The cumulative weighted slopes are used for feature extraction from the given time sequences. First the slopes are calcu-lated at corresponding points of each of the time sequence under observation. The slopes computed at corresponding points of the sequences are then assigned weights depending on the location of the slope along the time axis. Thus we obtain the weighted slopes for each of the time sequences which are then summed to obtain the cumulative weighted slope for the respective time sequence.

In this way, the cumulative weighted slope is computed for all the time sequences being studied. These cumulative weighted slopes are then grouped into clusters using k-means clustering method to identify similar patterns.

3.1 Feature Extraction Using Cumulative Weighted Slopes

In this section, we introduce the parameters for cumulative weighted slope. For the computation of cumulative weighted slope, all the time sequences are divided into same number of small strips of equal width along the time axis. In our approach, the weight given to the slope at a point is equal to the fraction of the strip number (for which the slope is being computed) to the total number of strips into which the time sequence has been divided.

Cumulative Weighted Slope Computation. The approach requires some data pre-processing steps to be performed prior to slope computations. It is assumed here that the time series database consists of p time sequences designated by X1, X2… Xp. Each time sequence Xi in turn can be represented as < (ti1, yi1), (ti2, yi2)… (tin, yin) >.

The first step in data preprocessing involves scaling of each of the time sequence Xi in the time series database along the time axis. This is done to equalize their time axes to some desired value say td. Thus their time axes become equal. The selection of td is done by the user and may depend on the domain of application of the data. In our technique, scaling along the time axis is done to help compare variable length time sequences. For example, a 5-year growth pattern of a Company A can be compared to a 10-year growth pattern of a Company B. In order to avoid any distortions that may arise due to aforesaid scaling along the time-axis, the values along the y-axis for each Xi are also scaled proportionately. Each transformed Xi denoted by Xi’ may be repre-sented as <(ti1

’, yi1’), (ti2

’, yi2’)… (tin

’, yin’) > where:

tik

’ = tik * ( td / tin ) and yik’ = yik * ( td / tin ) (2)

This is followed by dividing each time sequence in the database into same number

of small, equi-width strips along the time-axis as shown in Fig. 1. Thus each time sequence is divided into say m number of strips. The strips have different heights but same widths along the time-axis as shown in Fig. 1.

Page 5: Using Cumulative Weighted Slopes for Clustering Time ... · Using Cumulative Weighted Slopes for Clustering Time Series Data Durga Toshniwal1 and R. C. Joshi1 1 Department of Electronics

GESTS Int’l Trans. Computer Science and Engr., Vol.20, No.1 33

GESTS-Oct.2005

The first parameter introduced in this section to represent the cumulative weighted slope is denoted by WSsq (Xi) and is given as:

WSsq (Xi) = ( )2 mk * S 2 ik

m

1k ∑

=

(3)

Here S ik is the slope for the kth strip in the time sequence Xi , m is the total number of strips into which the time sequence has been divided. Here the slope S ik is given as:

S ik = { y“i(k+1) - y“

ik } / Δ t (4) We assume in (4) that the starting and ending coordinates for the kth strip of the

time sequence being analyzed are given by ( t 'ik , y”ik ) and ( t 'i(k + 1), y ”

i(k + 1)). And Δt is the width of each of the strips and is a constant. The choice of Δt may be user specified or domain specific. The important thing to note about the selection of Δt is that its value should be optimally selected so that it is neither too small (because that may lead to excessive computations) nor too large (loss of details).

The weight associated with the slope S ik as in (4) is given by (k / m) where k is the strip number for which the slope S ik is being computed and m is the total number of strips into which each time sequence is divided.

The next parameter introduced to represent the cumulative weighted slope is de-noted by WS cube(Xi) and is given as:

WS cube(Xi) = ( )333 mk* S ik

m

1k ∑

=

(5)

Here S ik is the slope for the kth strip in the time sequence Xi , m is the total number of strips into which the time sequence has been divided and the slope S ik is given as in (4). The weight assigned to slope S ik as given in (5) is (k / m)3 where k is the strip number for which the slope S ik is being computed and m is the total number of strips.

The cube of slopes has been specially chosen in (5) to account for the positive or negative sign of the weighted slopes for a given time sequence. We feel that the in-clusion of the sign plays a significant role while computing the cumulative weighted slope for a time sequence. Moreover the cube of (k / m) has been used in conjunction with the cube of S ik in the parameter WS cube(Xi). This has been done to exaggerate the role of the location of the strip (i.e. k) for which the slope S ik is being computed. Or, in other words, the weights have been assigned in (5) to emphasize the fact that a certain slope in a time sequence exists at a certain point. When the various slopes at certain points of the given time sequence are summed, we get the parameter as in (5).

Clustering Time Series Data. We have used the k-means clustering algorithm to group the centroids. The algorithm is outlined in Table 1. The iterations stop when the criterion function as in (1) converges. The overall strategy of the proposed method is summarized as follows: • Data pre-processing

Step 1: Scaling of data along the time-axis and correspondingly scaling the values of y-ordinate to avoid any possibility of data distortions. Step 2: Dividing each time sequence into same number of equi width strips.

Page 6: Using Cumulative Weighted Slopes for Clustering Time ... · Using Cumulative Weighted Slopes for Clustering Time Series Data Durga Toshniwal1 and R. C. Joshi1 1 Department of Electronics

34 Using Cumulative Weighted Slopes for Clustering Time Series Data

GESTS-Oct.2005

time t

y

y1 y2 yn

t1 t2 tn0

Fig. 1. Division of the normalized time series into n equi-width strips each having width Δ t

• Feature extraction using cumulative weighted slopes Step 3: Computing the parameter WSsq (Xi) or the parameter WS cube(Xi) for arriv-ing at the cumulative weighted slopes of the time sequences being analyzed.

• Clustering Step 4: Clustering of the parameters obtained in step 3 using k-means clustering algorithm.

4 Experimental Results

To prove the effectiveness of our approach, we have conducted experiments with synthetic time series datasets. The synthetic datasets used in this section have been specifically designed to illustrate the feature extraction method suggested in our ap-proach. A variety of shapes and reverse shapes have been used for our experiments. But due to lack of space, only small subsets of these are shown here. The application of k-means clustering on real life case data taken in our approach is dealt with in the next section.

The first sample dataset A considered is shown in Fig. 2. It comprises of A1, A2, A3 and A4. The data are pre-processed as discussed in Section 3. This involves scal-ing both along the x-axis and correspondingly along the y-axis taking td = 5 (can be user defined or domain specific). The number of strips into which each member of the dataset A has been divided is 10 (can be user defined or domain specific). The finally pre-processed dataset A is denoted by AS and is shown in Fig. 3.

The parameters WSsq (X) and WScube (X) computed for the pre-processed dataset A are given in Table 2. Clustering has been done using the k-means algorithm as ex-plained in Section 2 with k = 2. The resulting clusters on the basis of parameter WSsq (X) are shown in Table 3. The clusters resulting by applying k-means (k =2) cluster-ing technique to the parameter WScube (X) are also the same as shown in Table 3. In terms of dataset A, the first cluster consists of A1, A2 and A3 and the second one comprises of A4.

The next sample dataset under consideration is B and is shown in Fig. 4. The data has been pre-processed as discussed in Section 3 taking td = 5. The number of strips into which each member of the dataset B has been divided is 10. The finally pre-

Page 7: Using Cumulative Weighted Slopes for Clustering Time ... · Using Cumulative Weighted Slopes for Clustering Time Series Data Durga Toshniwal1 and R. C. Joshi1 1 Department of Electronics

GESTS Int’l Trans. Computer Science and Engr., Vol.20, No.1 35

GESTS-Oct.2005

processed dataset B is denoted by BS and comprises of B1S, B2S, B3S and B4S. The parameters WSsq (X) and WScube (X) computed for the pre-processed dataset BS are given in Table 4. Clustering is done using the k-means algorithm as explained in Section 3 with k = 2.

Table 1. K-Means Algorithm

S.No. Steps 1. Choose the value of k 2. Randomly select k objects as the cluster centers 3. Assign each object to a cluster to which it is most similar

(near) depending on the mean value for that cluster 4. Re-calculate the k cluster centers 5. Repeat steps 3 and 4 until the cluster centers stop moving

Fig. 2. Time series dataset A

Fig. 3. Finally pre-processed time series dataset A denoted by AS

Page 8: Using Cumulative Weighted Slopes for Clustering Time ... · Using Cumulative Weighted Slopes for Clustering Time Series Data Durga Toshniwal1 and R. C. Joshi1 1 Department of Electronics

36 Using Cumulative Weighted Slopes for Clustering Time Series Data

GESTS-Oct.2005

Table 2. Cumulative weighted slope computations for Dataset AS

Pre-processed Sequence

Parameter WSsq (X)

Parameter WS cube(X)

A1S 1.320 -0.467 A2S 1.288 -0.497 A3S 1.195 -0.745 A4S 3.548 -1.320

Table 3. Results of K-Means clustering applied to Table 2 (k = 2)

Cluster No. Description 1. A1S, A2S, A3S 2. A4S

Fig. 4. Time series dataset B

Table 4. Cumulative weighted slope computations for Dataset BS

Pre-processed Sequence

Parameter WSsq (X)

Parameter WS cube(X)

B1S 1.919 -1.120 B2S 1.843 -0.942 B3S 1.827 -1.037 B4S 3.526 -2.085

Table 5. Results of K-Means clustering applied to Table 2 (k = 2)

Cluster No. Description 1. B1S, B2S, B3S 2. B4S

The resulting clusters on the basis of parameter WSsq (X) are shown in Table 5. The

clusters resulting by applying k-means (k =2) clustering technique to the parameter

Page 9: Using Cumulative Weighted Slopes for Clustering Time ... · Using Cumulative Weighted Slopes for Clustering Time Series Data Durga Toshniwal1 and R. C. Joshi1 1 Department of Electronics

GESTS Int’l Trans. Computer Science and Engr., Vol.20, No.1 37

GESTS-Oct.2005

WScube (X) are also the same as shown in Table 5. In terms of dataset B, the first clus-ter consists of B1, B2 and B3 and the second one comprises of B4.

5 Case Study

The case study undertaken in this paper consists of similarity analysis of retail sales data (in millions of dollars) collected on a monthly basis over a period of 11 years (from 01/1992 to 12/2002) for chain retail stores in USA [16]. The length of each time sequence in the retail sales time series database consists of 132 datapoints (for each item under sales). We considered sales data of several types of retail businesses as listed in Table 6.

The time series data from the retail industry has been studied to analyze the sales patterns of different categories of products. Clustering the parameters for cumulative weighted slopes representing the retail sales time sequences can help identify the products which show similar sales patterns. This information can serve as an impor-tant tool for the retailers in leveraging their sales, designing effective marketing strategies, efficiently using their self space, forecasting inventory requirements and so on.

Table 6. Businesses considered in the case study

S. No. Description S. No. Description 1. Health and Personal Care

Stores 7. Men’s Clothing Stores

2. Pharmacies and Drug stores 8. Women’s Clothing Stores 3. Furniture Stores 9. Shoe Stores 4. Jewelry stores 10 New Car Dealers 5. Sporting goods, Hobby and

Music Stores 11. Used Car Dealers

6. Household Appliances Stores

Table 7. Cumulative weighted slope computations for the case study

S. No. Description ParameterWSsq (X)

1. Health and Personal Care Stores 8729.19 2. Pharmacies and Drug Stores 7517.66 3. Sporting goods, Hobby and Music

Stores 17739.05

4. Furniture Stores 5523.15 5. Used Car Dealers 2926.08 6. Jewelry Stores 12322.13 7. Women’s Clothing Stores 5664.91 8. Shoe Stores 3323.99 9. Household Appliances 895.42 10. Men’s Clothing Stores 2229.54 11. New Car Dealers 37761.62

Page 10: Using Cumulative Weighted Slopes for Clustering Time ... · Using Cumulative Weighted Slopes for Clustering Time Series Data Durga Toshniwal1 and R. C. Joshi1 1 Department of Electronics

38 Using Cumulative Weighted Slopes for Clustering Time Series Data

GESTS-Oct.2005

Table 8. Results of k-means clustering with k = 4

Clus-ter No.

Description

1. Men’s Clothing Stores, Shoe Stores, Used Car Dealers, and Household Appliances Stores

2. Furniture Stores, Women's Clothing Stores, Health and Personal Care Stores and Pharmacies and Drug stores

3. Jewelry Stores and Sporting goods, Hobby and Music Stores

4. New Car Dealers

The first step involves data pre-processing. The data has been pre-processed using

the steps outlined in Section 3. Thereafter, cumulative weighted slopes given by pa-rameter WSsq (X) have been computed for each of the retail sales time sequence data (each having 132 datapoints) as per the procedure described in Section 3. The results have been summarized in Table 7. After cumulative weighted slope computations, the k-means clustering algorithm has been employed to find groups of products having similar customer buying patterns. The results of clustering with k=4 are listed in Table 8. Those products whose cumulative weighted slope parameters lie in the same cluster exhibit similar sales patterns. Similarly the computations of the parameter WS cube(X) can also be done.

It can be concluded from Table 8 that the retail sales for the period of 11 years from January 1992 to December 2002 at men's clothing stores, shoe stores, household appliances' stores and used cars show similar sales patterns. Thus, this implies that the stores selling men's clothes may also have a shoes' section as these items show similar sales patterns. Or those stores selling men's clothes and shoes may place the latter in shelf spaces near the clothes section. All this provides the customer with convenience and at the same time may help boost sales of these items. Similarly, the information obtained by applying our approach to sales time series data from retail stores may also be applied to various other items of business to derive significant business strategies and rules.

6 Conclusions and Future Work

We have proposed a new and efficient technique for clustering time series data. It is based on whole sequence clustering. The proposed approach works by computing parameters representing cumulative weighted slopes for the time sequences under observation. These parameters are then clustered using k-means clustering algorithm. In this paper, we assume that a time series consists of a sequence of real numbers which represent the values of a measured parameter at equal intervals of time. The

Page 11: Using Cumulative Weighted Slopes for Clustering Time ... · Using Cumulative Weighted Slopes for Clustering Time Series Data Durga Toshniwal1 and R. C. Joshi1 1 Department of Electronics

GESTS Int’l Trans. Computer Science and Engr., Vol.20, No.1 39

GESTS-Oct.2005

proposed approach works irrespective of global scaling or shrinking of the time se-quences. It is also capable of handling different baselines.

The case data considered in this study is 11-years sales data collected from the re-tail chain stores in USA on a monthly basis from January, 1992 to December, 2002. Applying the proposed approach to the sales time series data from retail industry can reveal similarities in sales patterns of different items. This information may be very helpful to boost sales and design effective marketing strategies in the traditional retail industry as well as in e-business.

In further work we intend to obtain association rules from retail time series data by searching for groups of clusters that occur frequently together. Also, alternate cluster-ing algorithms may be employed for grouping the retail time series data. We also intend to employ a further enlarged dataset which may include sales data for many more types of businesses and places.

References

[1] G. Das, K. Lin, H. Mannila, G. Reganathan, and P. Smyth, " Rule Discovery from Time Series," Proc. of the 4th Int'l Conference on Knowledge Discovery and Data Mining, pp. 16-22, New York, NY, Aug 27-31, 1998.

[2] P. Cotofrei and K. Stoffel, "Classification Rules + Time = Temporal Rules," Proc. of the 2002 Int'l Conference on Computational Science, pp. 572-581, Amsterdam, Netherlands, Apr 21-24, 2002.

[3] X. Jin, L. Wang, Y. Lu, and C. Shi, "Indexing and Mining of the Local Patterns in Sequence Databases," Proc. of the 3rd Int'l Conference on Intelligent Data Engi-neering and Automated Learning, Manchaster, pp. 68-73, UK, Aug 12-14, 2002.

[4] E. Keogh and S. Kasetty, " On the Need for Time Series Data Mining Bench-marks: A Survey and Empirical Demonstration," Proc. of the 8th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining, Edmonton, pp. 102-111, Alberta, Canada, July 23-26, 2002.

[5] N. Radhakrishnan, J. D. Wilson, and P. C. Loizou, "An alternate Partitioning Technique to Quantify the Regularity of Complex Time Series," Int’l Journal of Bifurcation and Chaos, Vol. 10, No. 7, pp. 1773-1779, World Scientific Publishing, 2000.

[6] S. K. Harms, J. Deogun, and T. Tadesse, " Discovering Sequential Association Rules with Constraints and Time Lags in Multiple Sequences," Proc. of the 13th Int'l Symposium on Methodologies for Intelligent Systems, pp. 432-441, Lyon, France, June 27-29, 2002.

[7] C. Li, P. S.Yu, and V. Castelli, “MALM: A Framework for Mining Sequence Da-tabase at Multiple Abstraction Levels," Proc. of the 7th ACM CIKM Int'l Confer-ence on Information and Knowledge Management, pp. 267-272, Bethesda, MD, Nov 3-7, 1998.

[8] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kauf-mann Publishers, San Francisco, CA, 2002.

[9] A. K. Jain, M. N. Murty, and P. J. Flynn, "Data Clustering: A Survey," ACM Comput. Surv., Vol. 31, pp. 264-323, 1999.

[10] E. Keogh, J. Lin, and W. Truppel, "Clustering of Time Series Subsequences is Meaningless: Implications for Previous and Future Research," Proc. of the Int'l Conference on Data Mining, pp. – 115, 2003.

Page 12: Using Cumulative Weighted Slopes for Clustering Time ... · Using Cumulative Weighted Slopes for Clustering Time Series Data Durga Toshniwal1 and R. C. Joshi1 1 Department of Electronics

40 Using Cumulative Weighted Slopes for Clustering Time Series Data

GESTS-Oct.2005

[11] P. Cotofrei, " Statistical Temporal Rules," Proc. of the 15th Conference on Com-putational Statistics Berlin, Germany, Aug 24-28, 2002.

[12] R. Agrawal, C. Faloutsos, and A. Swami, “Efficient Similarity Search in Sequence Databases,” Proc. 4th Int’l Conf. Foundations of Data Organization and Algo-rithms, pp. 69-84, Chicago, Illinois, USA, 1993.

[13] D. Refiei, “ On Similarity Based Queries for Time Series Data,” Proc. 15th IEEE Int’l Conf. Data Engineering, pp. 410-417, Sydney, Australia, March 1999.

[14] F. Korn, H. Jagadish, and C. Faloutsos, “ Efficiently Supporting Ad hoc Queries in Large Datasets of Time Sequences,” Proc. ACM SIGMOD Int’l Conf. On Man-agement of Data, pp. 289-300, Tuescon, AZ, May 1997.

[15] J. MacQueen, "Some Methods for Classification and Analysis of Multivariate Ob-servations," Proc. of the 5th Berkeley Symposium Math. Statist., pp. 281-297, Prob., 1967.

[16] Economic Time Series Page, http://www.economagic.com

Biography

▲ Name: Durga Toshniwal Address: Department of Electronics & Computer Engineering, Indian Institute of Technology, Roorkee, India – 247 667 Education & Work experience: The author has done Bachelor of Engineering from JMI, India and Master of Technology from NIT Kurukshetra, India. Presently, she is a Research Scholar at the Indian Institute of Technology Roorkee, India. Her areas of research interest are – Time Series Data Mining and KDD Tel: +91-1332-271575 E-mail: [email protected]

▲ Name: R. C. Joshi Address: Department of Electronics & Computer Engineering, Indian Institute of Technology, Roorkee, India – 247 667 Education & Work experience: The author has done B.E. from Allahabad University, India, M.E. and Ph.D. from IIT Roorkee, India. Presently, he is a Professor at the Indian Institute of Technology Roorkee, India. His area of interest is Databases Tel: +91-1332-285650 E-mail: [email protected]