Top Banner
Research on Large Scale Data Sets Categorization Based on SVM Yongli Li 1 , Liyan Dong 2 , Minghui Sun 2 , Hongjie Wang 3 , Le Huang 2 , Xinxin Wang 2 , Meichen Dong 4 1.School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, P.R.China; [email protected] 2.College of Computer Science and Technology, Jilin University, Changchun, 130012,P.R.China; 3.Changchun Railway Vehicles CO.,LTD, Changchun, 130062,P.R.China; [email protected] 4.Economics School, Jilin University, Changchun, 130012,P.R.China; [email protected] Corresponding Author: Liyan Dong Email: [email protected] ABSTRACT Support Vector Machines algorithms are not appropriate for the large data sets because of high training complexity. To address this issue, this paper presents a two stage SVM classification algorithm based on fuzzy clustering. The algorithm is divided into two phases. In the first phase, an approximate decision hyper-plane is obtained by weighted SVM which using the data after the fuzzy clustering as training data sets. In the second phase, the decision hyper-plane is obtained by SVM using the data near to the approximate hyper-plane obtained in the first phase. Experimental results demonstrate that our approach has good classification accuracy while the training is significantly faster than the standard SVM. The improved approach has a distinctive advantage on dealing with huge data sets. KEYWORDS SVM, large data sets, Fuzzy clustering, category, classification 1 INTRODUCTION SVM is appropriate for the learning of the limited training data sets. Compared with the other classification methods, it has a good learning ability and generalization performance. However, in dealing with the large-scale training data sets, the learning task will consume a lot of time and require big memory capacity [1]. Since SVM needs to solve the quadratic programming (QP) in order to find a separation hyperplane with time complexity of order and space complexity, where M is the size of training data sets. Therefore, designing a classification algorithm applicable to large- scale data sets has become an important content in the research on SVM. Many efforts have been made on the classification on the classification for large data sets. Chunking is the first decomposition method. It uses an iterative approach to remove non- support vectors gradually, thereby reducing the memory requirements of the training process [2]. Platt proposed a sequential minimal optimization (SMO) algorithm, SMO breaks the large QP problem into a series of smallest possible QP problems, it is faster than chunking [3]. Oommen et al proposed an adaptive iteration strategy for large data sets, experiments show that this strategy can greatly reduce the computational complexity and can be widely applied to the field of the text classification [4]. Dong et al introduced a parallel optimization step where block diagonal matrices are used to approximate the original kernel matrix so that SVM classification can be split into hundreds of sub problems [5]. ISBN: 978-1-941968-16-1 ©2015 SDIWC 51 Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015
11

Research on Large Scale Data Sets Categorization Based on SVM

May 02, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Research on Large Scale Data Sets Categorization Based on SVM

Research on Large Scale Data Sets Categorization Based on SVM

Yongli Li1, Liyan Dong2, Minghui Sun2, Hongjie Wang3, Le Huang2, Xinxin Wang2, Meichen Dong4

1.School of Computer Science and Information Technology, Northeast Normal University, Changchun,130117, P.R.China;

[email protected] 2.College of Computer Science and Technology, Jilin University, Changchun, 130012,P.R.China;

3.Changchun Railway Vehicles CO.,LTD, Changchun, 130062,P.R.China;[email protected]

4.Economics School, Jilin University, Changchun, 130012,P.R.China;[email protected]

Corresponding Author: Liyan Dong Email: [email protected]

ABSTRACT

Support Vector Machines algorithms are not appropriate for the large data sets because of high training complexity. To address this issue, this paper presents a two stage SVM classification algorithm based on fuzzy clustering. The algorithm is divided into two phases. In the first phase, an approximate decision hyper-plane is obtained by weighted SVM which using the data after the fuzzy clustering as training data sets. In the second phase, the decision hyper-plane is obtained by SVM using the data near to the approximate hyper-plane obtained in the first phase. Experimental results demonstrate that our approach has good classification accuracy while the training is significantly faster than the standard SVM. The improved approach has a distinctive advantage on dealing with huge data sets.

KEYWORDS

SVM, large data sets, Fuzzy clustering, category, classification

1 INTRODUCTION

SVM is appropriate for the learning of the limited training data sets. Compared with the other classification methods, it has a good learning ability and generalization performance. However, in dealing with the large-scale

training data sets, the learning task will consume a lot of time and require big memory capacity [1]. Since SVM needs to solve the quadratic programming (QP) in order to find a separation hyperplane with time complexity of order and space complexity, where M is the size of training data sets. Therefore, designing a classification algorithm applicable to large-scale data sets has become an important content in the research on SVM. Many efforts have been made on the classification on the classification for large data sets. Chunking is the first decomposition method. It uses an iterative approach to remove non-support vectors gradually, thereby reducing the memory requirements of the training process [2]. Platt proposed a sequential minimal optimization (SMO) algorithm, SMO breaks the large QP problem into a series of smallest possible QP problems, it is faster than chunking [3]. Oommen et al proposed an adaptive iteration strategy for large data sets, experiments show that this strategy can greatly reduce the computational complexity and can be widely applied to the field of the text classification [4]. Dong et al introduced a parallel optimization step where block diagonal matrices are used to approximate the original kernel matrix so that SVM classification can be split into hundreds of sub problems [5].

ISBN: 978-1-941968-16-1 ©2015 SDIWC 51

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

Page 2: Research on Large Scale Data Sets Categorization Based on SVM

Cortes and Vapnik [6] showed that the weights of optimal classification hyper plane in feature space can be represented as linear combination of support vectors, which shows optimal hyper-plane is independent of other training samples except support vectors. One can select only a part of the samples, so-called support vectors, to train SVM, rather than the whole training sets if they can be found in advance. By this way, the learning time and space complexity will be greatly reduced. Based on this observation, some researches were reported to select patterns for SVM. Zheng [7] proposed a fuzzy support vector machine pretreatment method; Zhang [8] selected the potential support vector by calculating the distance ratio of the sample to the two types of samples, and Shih [9] reduced the size of the training set by making the count the training set of all sample information. Shin [10] selected the potential support vector set by improving the K nearest neighbor. Clustering is an effective tool to reduce data set size. For examples, hierarchical clustering [11], k-means cluster [12] and parallel clustering. Clustering based methods can reduce the computations burden of SVM. CT-SVM [13] applies reduction techniques by clustering analysis to find relevant support vectors in order to speed up the training process, the algorithm builds a hierarchical clustering tree for each class in the data set iteratively in several epochs. CB-SVM [14] applies a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples that carry the statistical summaries of the data such that the summaries maximize the benefit of learning the SVM. The Fuzzy clustering algorithm [15] is simple, easy to implement and has a high efficiency implement, so this paper used the fuzzy clustering as the clustering analysis method to large-scale training set reduction. Most of the SVM based on the clustering analysis use the center of the clusters as the training set, thus completed the SVM training. This method

improved the speed of the training, but since discarded a lot of support vector greatly reduced the classification accuracy of the SVM. For this shortcoming, this paper presents a Two-stage SVM classification algorithm based on fuzzy clustering. Our approach has good classification accuracy while the training is significantly faster than the standard SVM. The improved approach has distinctive advantages on dealing with huge data sets.

2 PROBLEM STATEMENT AND PRELIMINARIES

SVM [16] discriminates two classes by finding the optimal separating hyper plane that maximizes the margin between the samples located at the hyper plane border are referred to as support vector and used to create a decision surface. Assume that a training set X is given as:

1 1 2 2{( , ), ( , ), , ( , )}n nx y x y x yL , i.e. X= 1{ , }ni i iX x y ==

where dix R∈ and . Training SVM

yield to solved a quadratic programming problem as follws:

( 1, 1)iy ∈ + −

min 2

1

12

n

ii

w C ξ=

+ ∑s.t. (( ) ) 1i iy w x b iξ⋅ + ≥ − (1)

where iξ is slack variables to tolerate mis-classfications 0iξ ≥ , . C>0, C is a user specified penalty parameter which was used to control the punishment degree of the

misclassified training instances.

1, 2, ,i = L n

1

n

iiξ

=∑ reflects

the experience of the risk. The response to the dual problem is:

min 1 1 1

1 ( )2

n n n

i j i j i j ii j i

a a y y x x a= = =

⋅ −∑∑ ∑

s.t. 1

0n

i ii

a y=

=∑ , 0 ia C≤ ≤ , (2) 1, 2, ,i = L n

According to Kuhn-Tucker theorem, the optimal solution subjects to:

( (( ) ) 1) 0i i ia y w x b⋅ + − = , (3) 1, 2, ,i = L nIn the optimization process, each training

sample point has a corresponding , but the ia

ISBN: 978-1-941968-16-1 ©2015 SDIWC 52

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

Page 3: Research on Large Scale Data Sets Categorization Based on SVM

corresponding samples of have no effect for the classification results, only the corresponding sample points of have an impact on the classification; these points are called support vectors. Support vectors can be able to describe the whole training data sets, the division of the support vectors set is equivalent to division of the entire data sets. Therefore, if the support vectors can be found in advance, and only training the support vectors rather than the entire training data sets. By this way, the learning time and space complexity will be greatly reduced. The closer the sample points from the optimal separating hyperplane, the greater the likelihood of the support vector, these sample points are called the potential support vectors.

0ia =

0ia >

The final classification decision function is: ( ) sgn{( ) } sgn{ ( ) }i i i

i svf x w x b a y x x

= ⋅ + = ⋅ +∑ b

(4) According to the final value of ( )f x to

determine the category of sample points x , if ( )f x returns 1, then the sample points are

divided into positive class, otherwise to be divided into negative class.

3 TWO STAGE SVM ALGORITHM BASED ON FUZZY CLUSTERING

The idea of the SVM algorithm based on fuzzy clustering in two stages is as follows: we firstly obtain number of clusters by using fuzzy clustering, and only select the cluster centers and mixed clusters as the training data sets, then determine whether the training data sets contain enough sample, if the training data sets contain a sufficient number of samples, then we use these samples to train weighted SVM, the algorithm ends. If the samples are only a small part of the original training data, the fuzzy clustering may discard a large number of support vectors, and reduced the classification accuracy, so we take a two-stage SVM.

Two-stage SVM: firstly we train weighted SVM using the reduced samples, and get an approximate optimal hyper plane, calculate the

distance of each sample point to the approximate hyperplane, the data can be removed when the distance exceeds a certain threshold, for the data is useless to the classification. Secondly, we add the data of cluster whose centers are support vectors, then use the de-clustering data and the mixed clusters to train the second phase SVM. In this way, we can achieve the purpose of fast classification, and keep the classification accuracy.

The specific process of the algorithm shown in figure 1:

ϕ>=nn1

FIGURE 1. Two-stage SVM classification algorithm based on fuzzy clustering

A. Fuzzy Clustering Clustering essentially deals with the task of

splitting a set of patterns into a number of more-or-less homogenous classes (cluster) with respect to a suitable similarity measure, such that the patterns belonging to any one of the clusters are similar and the patterns of the different clusters are as dissimilar as possible.

Let us formulate the fuzzy clustering problem as: consider a finite a set of elements to be clustering 1 2{ , , , }nX x x x= L with d-dimension in the real space , anddR d

jx R∈ , 1,2, ,j n= L . Fuzzy clustering is to divide these

sample sets into k fuzzy subsets with respect to a given criterion. The criterion is usually to obtain the cluster center of each cluster, and

ISBN: 978-1-941968-16-1 ©2015 SDIWC 53

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

Page 4: Research on Large Scale Data Sets Categorization Based on SVM

making the objective function value of the distance indicators to a minimum. The result of the fuzzy clustering can be expressed by a partitioning matrix U such that , where is a numeric value in [0,1]. There are two constraints on the value of . First, total memberships of the element

1 , 1[ ]ij i k j nU u = == K K

iju

iju

jx X∈ in all classes are equal to 1. Second, every constructed cluster is non-empty and different from the entire set.

11, 1,2, ,

k

iji

u j=

= =∑ K n

,1

0 , 1, 2,n

ijj

u n i k=

< < =∑ K (5)

The general form of the objective function of fuzzy clustering is :

21

1 1 1( , ,..., )

k k nm

k i ij iji i j

J U c c J u d= = =

= =∑ ∑∑ (6)

Where is the cluster center of the fuzzy cluster i,

ic

ij i jd c x= − the Euclidean distance between the i-th clustering center and the j-th sample points, is call an exponential weight which influences the degree of fuzziness of the membership function.

[ )1,m∈ ∞

In order to make equation (6) reaching a minimum, we construct the following new objective function:

1 1 1 11

2

1 1 1

( , ,..., , ,..., ) ( , ,..., ) ( 1)

( 1)

cn

c n c j ijji

c n n cmij ij j ij

i j j i

J U c c J U c c u

u d u

λ λ λ

λ

==

= = =

= +

= + −

∑ ∑

∑∑ ∑ ∑

)(7)

Where ( 1, 2, ,j j nλ = K is the n constraints Lagrange multipliers of the formula (5). Calculate the derivative of the input parameters, we can get the necessary conditions making (6) to minimize:

1

1

nmij j

ji n

mij

j

u xc

u

=

=

=∑

∑(8)

and

2/( 1)

1

1ij m

cij

k kj

udd

=

=⎛ ⎞⎜ ⎟⎜ ⎟⎝ ⎠

∑(9)

Though cannot directly solved the above equation, fuzzy clustering can be handled through a simple iterative process, the specific process is as follows:

Step 1: Select a number of cluster (2 )k k n≤ ≤ and exponential weight (1 )m m< < ∞ .

Choose an initial partition matrix and a termination criterion

(0)Uε .

Step 2: Calculate the fuzzy cluster centers ( ){ 1,2, ,liv i k= K } using and (8). ( )lU

Step 3: Calculate the new partition matrix ( 1)lU + by using ( ){ 1,2, ,l

iv i k= K } and (9). Step 4: Calculate

( 1) ( ) 1,maxl l l

i j ij ijU U u u+ +Δ = − = − l . If εΔ > , then = l +1 and go to Step 2. if l εΔ ≤ , then stop.

The iterative procedure that described above minimizes the objective function and leaded to some of its local minimum. Through these operations, the fuzzy clustering divided the sample sets into k fuzzy subsets. However, for fuzzy clustering, the optimal number of clusters should be predefined which involves more computational cost than clustering itself. In this paper, we select the cluster centers and data of mix-labeled clusters as training data for the first phase SVM, we believe these data are the most useful and representatives in a large data set for finding support vectors. However, our objective is to select representatives data from a large data set for training SVM, we do not take care of the optimal number of clusters. Therefore, we selected some specific k cluster centers based on a random algorithm.

B. Two Phase SVM Classifier The classification task is divided into two

phases in the two-stage SVM. The first phase SVM classifier obtained an approximated decision hyperplane, then the data near by the hyperplane of the first phase SVM will be de-clustered, and be used as training data for the second phase SVM. A more precise

ISBN: 978-1-941968-16-1 ©2015 SDIWC 54

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

Page 5: Research on Large Scale Data Sets Categorization Based on SVM

classification can be obtained. The following subsections will give a detailed explanation on each step. Data selection

Let the training data sets 1 1 2 2{( , ), ( , ), , ( , )}n nX x y x y x y= L , where X belong to real space , denoted by dR d

jx R∈ , jy

is the class labels, and , {1, 1}jy ∈ −

1 2( , , , )Tj j j jd

dx x x x R= K ∈ , and each measure jix is a characteristic (attribute, dimension or variable). The process of the fuzzy clustering is based on finding k partitions of X,

, such that a) and b)

1 2{ , , , }( )kC C C C k n= K <

k1

ki iC X=∪ =

, 1, ,iC iϕ≠ = K , where: { ( 11)}i j jC x y= ∪ ∈ − , 1, ,i k= K, . After the fuzzy clustering, the obtained

clusters can be classified into two types. 1) clusters with only positive labeled

data or with only negative labeleddata, called unified clusters, denotedby { 1u j j jC x y y= ∪ ∈− ∧ ∈1} .

2) clusters with both positive andnegative labeled data(or mix-labeled),called mixed clusters, denoted by

{ 1m j j jC x y y= ∪ ∈− ∨ ∈1}. Figure 2 illustrates the clusters which were divided by fuzzy clustering, where the clusters with the circle points are positive labeled, the clusters with the box points are negative points. By definition, the cluster A and B are mixed clusters, the other clusters are unified clusters.

FIGURE 2. Clustering the original data sets; cluster A and B are mixed clusters

The samples in the mixed clusters are likely to become the support vectors which is called the potential support vectors, so we not only select the center of the unified clusters, also need to select the whole samples in the mixed clusters, then the selected data which will be used in the first phase SVM classification. Let the selected data

1 1 1 2 2 1 1{( , ), ( , ), , ( , )}n nX x y x y x y= L , where n1 representative of the selected the number of samples. If the selected data contained a sufficient number of samples, in order to reduce the training time, we do not need the second phase SVM. So if 1n

nφ>= , where φ

represented a threshold, in this paper, set the value of φ 0.01, then we only used the first phase weighted SVM classification, the algorithm ends. Otherwise, we need to use the second phase SVM classification.

The first phase weighted SVM classification

Before trained SVM using the selected data, we must take into account that the role of the training samples for classification is not the same. The size of data sets that included in the unified cluster is different, in other words, the choose center of clusters represent the different size of data sets, the center of uniform clusters which include more data sets are more important than the clusters which contain few data sets, and these center of uniform clusters are more important than the data sets which included in the mixed clusters. Therefore, if these data to be handled in the same manner as the SVM training, will be greatly reduced the similarity between the approximate hyperplane and the final optimal hyperplane, thus affecting the selection of the potential support vector. In this paper, in order to approximate the hyperplane as close as

ISBN: 978-1-941968-16-1 ©2015 SDIWC 55

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

Page 6: Research on Large Scale Data Sets Categorization Based on SVM

possible to the optimal hyperplane, we modify the standard SVM to reflect the importance the different data. The modified quadratic form of the SVM optimization problem as follows:

min1

12

nT

i ii

w w C λξ=

+ ∑

s.t. (( ( )) ) 1Ti i iy w x bϕ ξ⋅ + ≥ − 0i, ≥ (10) ξ

The improved quadratic form introduces a weighted value of the classification error, the weight value iλ of a unified cluster center is defined as:

iu

CUniform_Number kiλ = (11)

where iC is the number of samples included int the i-th unified cluster.

is the sum of the samples that all unified clusters contained ,represents the number of all unified clusters, the weight value

Uniform_Numberuk

iλ of a mixed cluster is defined as:

mm

CMixed_Number kiλ = (12)

where sample i is belong to the mixed cluster , mC mC is the number of samples that the mixed cluster contained,

is the sum of samples that all mixed clusters contained, represents the number of all mixed clusters. By the definition of the weight value, the more the number of samples contained in the clusters, the greater the weight value which corresponds to the unified cluster centers.

mCMixed_Number

mk

(10) corresponds to the dual problem is : min

1 1 1

1 ( )2

n n n

i j i j i j ii j i

a a y y x x a= = =

⋅ −∑∑ ∑

s.t. , 1

0n

i ii

a y=

=∑ 0 i ia Cλ≤ ≤ (13)

Compared with the standard SVM, the dual problem of the improved SVM adds a multiplier on the upper limit of the lagrange multiplier. After simply modifying, the algorithm to solve the SVM problem can be

applied to the weighted SVM. By solving the above dual problem, we can get an approximate hyperplane which is closed to the optimal hyperplane. The obtained decision hyperplane may not be precision enough, however, at least it gives us a reference on data distribution, and it provides a good guide for the further determination of the support vector. Figure 3 shows the first phase of weighted SVM, the training data set using the sample in the mixed clusters and the centers of the unified clusters.

FIGURE 3. The first phase SVM; training it using the data in the mixed clusters and centers of the unified clusters.

The second phase SVM classification Note that, the original data sets are reduced significantly after data selection, and the training set in the first phase SVM classification is only a small percentage of the original. This may affect the classification precision, i.e., the obtained decision hyperplane cannot be precision enough. To address this shortcoming, we should to make a classification using as many useful data as possible. So, a natural idea is to recover those data which are near to the cluster of the support vectors, and use the recover data to train the second phase SVM. By the sparse property of SVM, the data which are not support vectors will not contribute the optimal hyperplane. The input data sets which are far away from the decision hyperplane should be elimated, meanwhile the data sets which are possible support vectors

ISBN: 978-1-941968-16-1 ©2015 SDIWC 56

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

Page 7: Research on Large Scale Data Sets Categorization Based on SVM

should be used. According to above analysis, we make the following modification on the training data set of the first phase weighted SVM. 1) Remove the data which the distance betweenthe cluster center of the unified cluster and the approximation decision hyper plane exceeds a certain threshold, because they will not contribute to find the support vectors.

After the first phase weighted SVM, we can get a approximation decision hyper plane h, and then delete the centers of the unified clusters whiche were non-support vector. Set sLrepresents the classification interval, sLrepresents the distance between the center jO

of a cluster jC and h, if jL meets the condition:

j sL L δ− ≤ (14) then jC is the support vector clusters, where

δ represents the distance threshold, and meets the condition 0 1δ< < .

(14) shows that the support vector cluster is the center of the cluster which is near to the approximation hyper plane h, and non-support vector cluster is the center of the cluster which is far away from the h. Figure 4 shows that the center of cluster A and C belong to the support vector cluster, the center of B and D belong to the non-support vector clusters.

0=+⋅ bxw

sL

1=+⋅ bxw

1−=+⋅ bxw

jLδ

sL

δ

jL

FIGURE 4. support vector cluster h is similar to the optimal decision

hyperplane , the data which closes to h, is also closes to . Therefore, the samples that subordinate to the support vector cluster is likely to become to support

vectors of the optimal decision hyper plane, meanwhile remove the centers of non-support vector clusters.

'h'h

2) Keep the data of the mixed clusters sincethey are more likely support vectors. 3) additionally, recover the data whichcontained in the support vector clusters, we can this process de-clustering. Thus, more potential support vectors near the approximate decision hyper plane can be found through the de-clustering.

Figure 5 shows the training data sets that the second SVM used.

Taking the recovered data as new training data set, we use again SVM classification to get the final optimal decision hyper plane:

'

1( ) sgn( ( ( , ) )

n

i i ii

f x a y K x x b=

= +∑(15)

where is the size of training data sets after recovered.

'n

FIGURE 5. Recovering the data in the unified clusters close to the approximate hyper plane

4 ALGORITHM DESCRIPTION AND TIME COMPLEXITY ANALYSIS A. Algorithm Description

The description of the two-stage SVM classification algorithm based on fuzzy clustering as follows:

Input: training data set D, the distance threshold δ

ISBN: 978-1-941968-16-1 ©2015 SDIWC 57

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

Page 8: Research on Large Scale Data Sets Categorization Based on SVM

Output: the final optimal decision hyperplane

1) ( , ) :=FCMCluster(D,K); // The fuzzy clustering divided the training data set into k fuzzy subsets, is the unified clusters , is the mixed clusters, is the centers of the unified clusters

u mC C∪ centerD

uC

mC

centerD

2) := ; // Used the data in the mixed clusters and as the training data set

initialD m centeC D∪

B. Time Complexity Analysis

r

centerD

3) :=weight_SVM.train( ); // Training weighted SVM using , and return the approximate decision hyperplane

'h initialD

initialD

'h

4) if ( 0.01initialDD

>= )// initialD represents

the number of training data set , D

represents the number of original data set 5) { return 'h ; }6) else7) {8)

:=getLowerMarginCluster( , ,'centerD 'h centerDδ ); // get the centers of unified clusters

which closed to the approximate decision hyperplane , where

'centerD'center centerD D∈

9) for i :=1 to 'centerD

10) { 11) :=getClusterData( ); //

Recovering the data in the unified clusters which closed to the approximate decision hyperplane

'iD iC

12) := ;'reducedD ' 'reduced iD D∪

13) }14) ; // Used the data in

the mixed clusters and the support vector clusters as the traing data set for the SVM algorithm

'reduced reduced mD D= ∪C

)iter k n× ×

k n

15) h :=SVM.train( ); // Obtained the optimal decision hyperplane h using the SVM

reducedD

16) return h;17) }

The training time of the approach proposed in this paper includes two parts: the fuzzy clustering algorithm and the two stage SVM classfication algorithm. The time complexity of fuzzy clustering is O , for a large, (

< 100iter, << <

)2.2(( ) )m+

) )m+

2.2( ) (( ) )r k n O l l m× × + + +

( )n l l m>> + +

, where k represents the number of clusters, iter is the maximum number of iterations. According to the reference [16], the major computational cost is

. Therefore, in the first weighted SVM, the time complexity is O l , where l is the number of the centers of unified clusters, m is the the number of data that the mixed clusters contained. In the second SVM, the time complexity is , where l is the number of data that the support vector clusters contained. So the total time complexity is

. For a large data set, , the training time of SVM is much longer than that of our approach.

2.2(O n

2.21((O l 1

1O iteΘ =

1

5 EXPERIMENTIAL RESULTS The experimental data set is Chinese text of

the experimental corpus which is provided by Fudan University. We randomly select some text as the training data set from six representative categories, and then randomly select some text in each category as the testing data set. The distribution of the data set of categories shown in table 1:

TABLE 1. Distribution of the data sets

We use the data set to train the SVM and the improved Two-stage SVM classification algorithm based on fuzzy clustering, and make a comparison on the performance of two

Category Art Environment

Agriculture

Economy

Politics

Sports

Training data set

500 600 500 750 500 600

Testing data set

250 300 250 350 250 300

ISBN: 978-1-941968-16-1 ©2015 SDIWC 58

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

Page 9: Research on Large Scale Data Sets Categorization Based on SVM

classifiers. Table 2 is the results of the comparison for the overall assessment of the classification performance on the SVM and the improved SVM, table 3 shows the time consuming comparison

TABLE 2. The comparison for the overall assessment of the classification performance on the SVM and the improved SVM where the size of training data sets is 3450 overall assessment

Macro average recall rate

Macro average precision

Micro average recall rate

Micro average precision

SVM 95.39% 95.28% 95.29% 95.29% Improved SVM

95.37% 95.20% 95.24% 95.24%

TABLE 3. The training time comparison using the SVM and the improved SVM when the size of the training data sets is 3450

Table 2 and 3 show that our improved classifier had an almost the same accuracy as the SVM, declined only a little, the value of decline on the micro average precision is

95.29%-95.24%=0.05% But our improved classifier training time

has improved, the time consuming compared with the SVM has reduced (56-52) =4(s).

In text classification, the size of text has a certain degree of impact on the accuracy and the efficiency of the algorithm. Therefore, we increased the training data sets and the testing data sets. Table 4 shows the distribution of the increased data sets, Table 5 shows the overall performance assessment of the two classification algorithms, and table 6 shows the time consuming comparison.

TABLE 4. Distribution of the increased data set

TABLE 5. The comparison for the overall assessment of the classification performance on the SVM and the improved SVM when the size of training data sets is 7000

overall assessment

Macro average recall rate

Macro average precision

Micro average recall rate

Micro average precision

SVM 95.95% 95.84% 95.85% 95.85% Improved SVM 96.05% 95.91% 95.95% 95.95%

TABLE 6. The training time comparison using the SVM and the improved SVM when the size of the training data sets is 7000 Classifier type

Training data sets

Category number

Time of clustering (s)

Time of SVM (s)

The total time (s)

SVM 7000 6 0 185 185Improved SVM

7000 6 31 123 154

Table 5 and 6 show that the classification accuracy of our improved classifier has been a part of rise compared with the SVM, and the increased value is

95.95%-95.85%=0.10% Experimental results show that with the

increased data sets, the improved SVM has a higher classification accuracy, and the training time has a more significant improvement compared to the small data set, the improved time is (185-154)=31(s).

For a more intuitively compared the training time of the two classification algorithms, we respectively calculated the training time using the two classification algorithms when the size of training data sets is 3000, 4000, 5000, 6000, 7000 and the experimental results shown in Figure 6.

Classifier type

Training data sets

Category number

Time of clustering(s)

Time of SVM(s)

The total time (s)

SVM 3450 6 0 56 56 Improved SVM

3450 6 15 37 52

category Art Environment

Agriculture

Economy

Politics

Sports

Training data set

1000

1200 1000 1600 1000 1200

Testing data set

500 600 500 800 500 600

ISBN: 978-1-941968-16-1 ©2015 SDIWC 59

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

Page 10: Research on Large Scale Data Sets Categorization Based on SVM

40

60

80

100

120

140

160

180

200

3000 4000 5000 6000 7000

Tra

ingi

ng ti

me

(s)

The size of training data sets

SVM training time

Improved SVM trainitime

FIGURE 6. The size of training data sets on the impact of the training time

Figure 6 shows that the training time of improved SVM is always less time consuming than SVM, and with the increase of the training data sets, the SVM has a higher growth rate than the improved SVM. We can predict, the larger number of the training data sets, the advantage of the improved SVM’s training time will be more obvious.

Experimental results show that when dealing with the large data set, the two stage SVM classification algorithm based on fuzzy clustering can have an almost the same accuracy as SVM, while its training time is super short. The larger of the training data set, the greater of the advantage, but when the data set is very small, the classification accuracy will be reduced greatly.

6 CONCLUSIONS For large data sets, this paper proposed a

two stage SVM classification algorithm based on fuzzy clustering. We firstly used the fuzzy clustering to remove part of the non-support vectors in the training data sets, and trained the first phase weighted SVM using the reduced data sets. Then determined whether the algorithm ends according to a certain criteria, if not, we made a modification on the training data set of the first phase weighted SVM, then used the second phase SVM. Experimental results demonstrated that our approach had good classification accuracy while the training is significantly faster than the standard SVM. The improved approach

has distinctive advantages on dealing with huge data sets.

ACKNOWLEDGMENT This study has been partially supported by the NSFC of China (No. :61300145), China Postdoctoral Science Foundation (2014M561294) and Natural Science Foundation for Young Scholars of Jilin Province, China (20150520065JH).

REFERENCES

[1] Hyunjung Shin, Sungzoon Cho. Invariance of neighborhood relation under input space to feature space mapping, Pattern Recognition Letters, vol.26, no.1, pp.707-718, 2005.

[2] Bernhard E. Boser, Isabelle M. Guyon, Vladimir N. Vapnik. A training algorithm for optimal margin classifiers, Proceeding of the 5th Annual ACM Workshop on Computational Learning Theory ACM Press, New York, USA, pp.144-152, 1992.

[3] J.Platt, Sequential minimal optimization: A fast algorithm for training support vector machines. Advances in Kernel Methods-Support Vector Leaarning, Cambridge, MA: MIT Press pp.1-21, 1999,.

[4] S.W.Kim, Oommen, B.J., Enhancing prototype reduction schemes with recursion: a method applicable for “large” data sets, IEEE Transactions on Systems, vol.34, no.3, pp.1384-1397, 2004.

[5] Jian-xiong Dong, Adam Krzyzak, Ching Y. Suen. Fast SVM training algorithm with decomposition on very large data sets, Pattern Analysis and Machine Intelligence, vol.27, no.4, pp.603-618, 2005.

[6] M.B.Almeida, A.Braga, J.P.Braga. SVM-KM: Speeding SVMs learning with a priori cluster selection and k-means, In Proceedings of the 6th Brazilian Symposium on Neural Networks, Brazil, pp. 162-167, 2000.

[7] Chun-Hong Zheng, Li-cheng Jiao. Fuzzy pre-extracting method for support vector machine, Proceedings of the First International Conference on Machine Learning and Cybernetices. Beijing, pp.2026-2030, 2002,.

[8] Zhang Li, Zhou Weida, Jiao Licheng. Pre-extracting support vectors for support vector machine, Signal Processing Proceedings, vol.3, pp. 1432-1435, 2000.

[9] Lawrence Shih, Jason D.M.Rennie, Yu-Han Chang.Text bundling: statistic-based data reduction, In Proceedings of the 20th International Conference on Machine Learning, Washington DC, 2003.

[10] Hyunjung Shin, Sungzoon Cho. Fast pattern selection for support vector classifiers, Lecture Notes in Commputer Science, vol.2637, no.567, pp. 376-387, 2003.

[11] Rui Xu, Donald Wunsch II. Survey of clustering algorithms, Neural Networks, vol.16, no.3, pp. 645-678, 2005.

[12] Jair Cervantes, Xiaoou Li, WenYu. SVM Classification based on fuzzy clustering for large data sets, Lecture Notes in Computer Science, vol.4293, pp. 572-582, 2006.

ISBN: 978-1-941968-16-1 ©2015 SDIWC 60

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015

Page 11: Research on Large Scale Data Sets Categorization Based on SVM

[13] Khan L., Awad M and Thurasiingham B., A new intrusion detection system using support vector machines and hierarchical clustering, The VLDB Journal, vol.16, no.4, pp. 507-521, 2007.

[14] Hwanjo Yu, Jiong Yang, Jiawei Han. Classifying large data sets using SVMs with hierarchical clusters, In proceedings of the 9th ACM SIGKDD international conference on knowledge disvovery and data mining, New York, USA, pp. 306-315, 2003.

[15] N. Pal and J. Bezdek, On cluster validity for the fuzzy c-means model, IEEE Trans. Fuzzy Syst., vol.3, no.3, pp. 370-379, 1995.

[16] Vapnik V. The Nature of Statistical Learning Theory(2nd editon), NewYork: Springer Press, pp.421-440, 2000.

ISBN: 978-1-941968-16-1 ©2015 SDIWC 61

Proceedings of The Fourth International Conference on Informatics & Applications, Takamatsu, Japan, 2015