Pattern Extraction from Data

Fundamenta Informaticae 34 (1998) 1{16 1IOS PressPATTERN EXTRACTION FROM DATASinh Hoa Nguyen, Hung Son NguyenInstitute of MathematicsWarsaw UniversityBanacha 2, Warsaw 02097, PolandE-mail: [email protected], [email protected]. Searching for patterns is one of the main goals in data mining. Patterns haveimportant applications in many KDD domains like rule extraction or classi�cation. In thispaper we present some methods of rule extraction by generalizing the existing approachesfor the pattern problem. These methods, called partition of attribute values or grouping ofattribute values, can be applied to decision tables with symbolic value attributes. If datatables contain symbolic and numeric attributes, some of the proposed methods can be usedjointly with discretization methods. Moreover, these methods are applicable for incompletedata. The optimization problems for grouping of attribute values are either NP-completeor NP-hard. Hence we propose some heuristics returning approximate solutions for suchproblems.1. IntroductionWe consider decision tables containing objects represented by vectors of attribute values. For-mally, attributes are de�ned as functions from the set of objects into a corresponding set of values(domains). We distinguish two types of attributes, called continuous (numeric) and nominal(symbolic) with respect to their domains. We also assume that every object belongs to one of thedecision classes. The classi�cation problem is de�ned as the problem of classifying new unseenobjects to proper decision classes using accessible information about objects from a training set.There are many classi�cation methods e.g. AQ [10], C4.5, ID-3 [16], which are based ontwo main approaches, called decision trees and pattern extraction. All decision tree methodsare based on searching for classi�ers labeling internal nodes. Usually, classi�ers are descriptors(i.e. pairs of attributes and values), hence this approach is more suitable for nominal attributesrather than for continuous ones, because the rami�cation number of a node is equal to the

2 Hoa & Son / Pattern Extraction from Datanumber of possible values of the attribute domain. In case of continuous attributes, eithersome discretization processes must be applied or another type of classi�ers used; for exampleone can use a simple test of the form: "if attribute value is less than a given value" to create abinary internal node. The pattern-based classi�cation methods consist in looking for descriptionsof decision classes by logical formulas and they also require some discretization processes forcontinuous attributes. In general, every discretization method aims at removing super uousattribute information by unifying values in some intervals and preserving necessary information(e.g. discernibility between objects). Hence �nding suitable interval boundaries is the maingoal of all discretization methods. The following question arises: "What we can do in case ofsymbolic attributes with a large number of possible values?". One of the possible solutions is topartition the attribute domain into smaller ones.In previous papers [11, 13, 14] we presented discretization methods based on rough sets theoryand Boolean reasoning approach as well as methods for patterns extraction when pattern areconjunctions of simple descriptors. This paper focuses on the symbolic value partition problemwhich is a generalization of both the discretization and the pattern extraction problems. To solvethis problem we adopt well known techniques based on Boolean reasoning and decision trees.Moreover, the computational complexity of some optimal partition problems are studied and weshow that they are either NP-complete or NP-hard. Therefore, we propose some heuristics andevaluate their accuracy on di�erent data sets.2. PreliminariesAn information system [15] is a pair A = (U;A), where U is a non-empty, �nite set called theuniverse and A is a non-empty, �nite set of attributes, i.e. a : U ! Va for a 2 A, where Va iscalled the value set of a. Elements of U are called objects.Any information system of the form A = (U;A [ fdg) is a decision table where d =2 A iscalled decision and the elements of A are called conditions.Let Vd = f1; : : : ; r(d)g. The decision d determines the partition fC1; :::; Cr(d)g of the universeU , where Ck = fx 2 U : d(x) = kg for 1 � k � r(d). The set Ck is called the k � th decisionclass of A.With any subset of attributes B � A, an equivalence relation called the B-indiscernibilityrelation [15], denoted by IND(B), is de�ned byIND(B) = f(x; y) 2 U � U : 8a2B (a(x) = a(y))gObjects x; y satisfying the relation IND(B) are indiscernible by attributes from B. By [x]IND(B)we denote the equivalence class of IND (B) de�ned by x. Any minimal subset B of A such thatIND(A) = IND(B) is called a reduct of A.Let A = (U;A[fdg) be a decision table and B � A. We de�ne a function @B : U ! 2f1;::;r(d)g;called the generalized decision in A, by @B(x) = d �[x]IND(B)�.A decision table A is called consistent (deterministic) if card (@A(x)) = 1 for any x 2 U ,otherwise A is inconsistent (non-deterministic).

Hoa & Son /Pattern Extraction from Data 33. Searching for patterns in dataWe have mentioned in Section 1 that discretization of real value attributes is a necessary pro-cess for almost all classi�cation methods. In this section we recall some discretization techniqueswhich will be adopted to solve partition problems for symbolic value attributes. We take intospecial consideration the Boolean reasoning approach for the both discretization problem andthe partition problem. Moreover, we recall the problem of searching for templates in data [14] toshow that partition methods also produce generalized templates. In this way the partition prob-lem is becoming a common generalization of the both discretization and the template extractionproblems.3.1. DiscretizationLet A = (U;A [ fdg) be a decision table, where A = fa1; :::; akg and d : U ! f1; :::; rg. Anypair (a; c) where a 2 A and c 2 IR will be called a cut on Va. Any set of cuts:Pa = f(a; ca1); (a; ca2); : : : ; (a; caka)g � A� IRon Va = [la; ra) � IR (for a 2 A) uniquely de�nes a partition of Va into subintervals i.e.Va = [ca0; ca1) [ [ca1; ca2) [ : : : [ [caka ; caka+1)where la = ca0 < ca1 < ca2 < : : : < caka < caka+1 = ra. Therefore, any set of cuts P = Sa2APade�nes a new decision table AP = (U;AP [ fdg) called the P-discretization of A, where AP =faP : a 2 Ag and aP(x) = i , a(x) 2 [cai ; cai+1) for x 2 U and i 2 f0; ::; kag.Two sets of cuts P0;P are equivalent, i.e. P0�AP, i� AP = AP0 . The equivalence relation�A has a �nitely many equivalence classes. In the sequel we will not discern between equivalentsets of cuts. The set of cuts P is A-consistent if @A = @AP , where @A and @AP are generalizeddecisions of A and AP, respectively. An A-consistent set of cuts Pirr is A-irreducible if P is notA-consistent for any P � Pirr. An A-consistent set of cuts Popt is A-optimal if card �Popt� �card (P) for any A-consistent set of cuts P.Any cut (a; c) splits the set of values (la; ra) of the attribute a into two intervals I1 = (la; c),I2 = (c; ra). For a �xed cut (a; c) we use the following notation:r = the number of decision classes of AAij = the set of objects of jth class in the ith intervalnij = the number of objects in AijRi = the number of objects in the ith interval for i 2 f1; 2g (i.e. Ri = Prj=1 nij)cj = P2i=1 nij = jCjj = cardinality of the jth decision class (see Section 2)n = total number of objects (i.e. n = P2i=1Ri = Prj=1 cj )Eij = expected frequency of Aij (i.e. Eij = Ri�cjn )where i 2 f1; 2g; j 2 f1; :::; rg.

4 Hoa & Son / Pattern Extraction from Data3.1.1. Statistical test methodsStatistical tests allow to check the probabilistic independence between the object partition de-�ned by the decision attribute and the partition de�ned by the cut (a; c). The independencedegree is estimated by the �2 test described by�2 = 2Xi=1 rXj=1 (nij �Eij)2EijIntuitively, if the partition de�ned by c does not depend on the partition de�ned by thedecision d then: P (Cj) = P (Cj jI1) = P (CjjI2) (1)for any j 2 f1; :::; rg : The condition (1) is equivalent to nij = Eij for any i 2 f1; 2g andj 2 f1; :::; rg ; hence we have �2 = 0: When a cut c "properly" separates objects from di�erentdecision classes, the value of the �2 test for c is maximal.Discretization methods based on the �2 test only choose cuts with large values of this test(and delete the cuts with small values of the �2 test). Some versions of this method can befound in [9].3.1.2. Entropy methodsA number of methods based on entropy measure have been developed for the discretizationproblem. They use class-entropy as a criterion to extract a list of the best cuts which togetherwith the attribute domain induce the desired intervals.The class information entropy of the partition induced by a cut point c on attribute a isde�ned by E (a; c;U) = jU1jn Ent (U1) + jU2jn Ent (U2)where U1; U2 are the sets of objects on the two sides (left and right) of the cut c and the functionEnt is de�ned by Ent(Ui) = � rXj=1 nijRi log nijRi for i = 1; 2:For a given feature a, a cut cmin minimizing the entropy function over all possible cuts isselected. This method can be applied recursively to both object sets U1; U2 induced by cminuntil some stopping condition is achieved. There is a number of discretization methods basedon information entropy (see e.g. [2, 6, 3, 16]). We mention as an example a method usingMinimal Description Length Principle to determine some stopping criteria for their recursivediscretization strategy ( see [6]). First the Information Gain of the cut (a; cmin) over the set ofobjects U is de�ned by: Gain (a; cmin;U) = Ent (U)�E (a; cmin;U)

Hoa & Son /Pattern Extraction from Data 5and the recursive partitioning within a set of objects U stops i� Gain (a; cmin;U) is too small tocontinue the spliting process i.e.Gain (a; c;U) < log2 (n� 1)n + � (a; c;U)nwhere � (a; c;U) = log2 (3r � 2)� [r � Ent (U)� r1 �Ent (U1)� r2 � Ent (U2)];and r; r1; r2 are the numbers of decision classes in U;U1; U2, respectively.3.1.3. Boolean reasoning methodIn [11, 12] we have presented a discretization method based on rough set theory and Booleanreasoning. The main idea of this method is based on a construction and an analysis of a newdecision table A� = (U�; A� [ fd�g) where� U� = �(u; v) 2 U2 : d (u) 6= d (v) [ f?g� A� = fc : c is a cut on Ag, where c (?) = 0 and c ((u; v)) = ( 1 if c discerns u; v0 otherwise� d� (?) = 0 and d� (u; v) = 1 for (u; v) 2 U�.It has been shown [11] that any relative reduct of A� is an irreducible set of cuts for Aand any minimal relative reduct of A� is an optimal set of cuts for A. The Boolean functioncorresponding to the minimal relative reduct problem fA� has O (nk) variables (cuts) and O �n2�clauses. HenceTheorem 3.1. [11] The decision problem of checking if for a given decision table A and aninteger k there exists an irreducible set of cuts P in A such that card(P) < k is NP -complete.The problem of searching for an optimal set of cuts P of a given decision table A is NP -hard.The discernibility degree of a cut (a; c) is de�ned as the number of pairs of objects fromdi�erent decision classes (or the number of objects in the table A�) discerned by c. This numbercan be computed by Disc(a; c) = Xi6=j n1i � n2j = R1 � R2 � rXi=1 n1i � n2iThe discretization algorithm called MD-heuristic 1 performs a search for a cut (a; c) 2 A� withthe largest discernibility degree Disc(a; c). The cut c is then move from A� to the resulting setof cuts P and all pairs of objects discerned by c are removed from U�. Our algorithm runs untilU� = f?g. It has been shown in [12] that MD-heuristic is very e�cient, because it determinesthe best cut in O (kn) steps using O (kn) space only.1Abbreviation of "Maximal Discernibility Heuristics"

6 Hoa & Son / Pattern Extraction from DataObjects a1 a2 a3 a4 a5 dx1 5 1 0 1.16 black 1x2 4 0 0 8.33 red 0x3 5 1 0 3.13 red 1x4 5 0 0 3.22 black 1x5 1 0 1 3.24 red 0Table 1. An example of the template with �tness 2 and length 33.2. Searching for patterns in dataA template T of A is any propositional formula V pi; where pi are propositional variables, calleddescriptors, of the form ai = v, ai is an attribute of A, ai 6= aj for i 6=j, and v is a valuefrom the value set of the attribute ai. Assuming A = fa1; :::; amg, any template T = (ai1 =vi1)^ :::^ (aik = vik) can be represented by a sequence [x1; :::; xm] where xp is vp if p = i1; :::; ikand "�" ("don't care" symbol), otherwise. An object x satis�es the descriptor a = v if a(x) = v.An object x satis�es (matches) a template if it satis�es all the descriptors of the template. Forany template T by length(T ) we denote the number of di�erent descriptors a = v occurring in Tand by fitnessA(T ) its �tness i.e. the number of objects from the universe U satisfying T . If Tconsists of one descriptor a = v only we also write nA(a; v) (or n(a; v)) instead of fitnessA(T ).The quality of a template T can be taken to be the number fitnessA(T ) � length(T ). If s isan integer then we denote by TemplateA(s) the set of all the templates of A with �tness notless than s.4. Symbolic value partition problemWe have considered the real value attribute discretization problem. It is a searching problemfor a partition of real values into intervals (the natural linear order "<" in the real space IR isassumed).In case of symbolic value attributes (i.e. without any pre-assumed order in the value sets ofattributes) the problem of searching for partitions of value sets into a "small" number of subsetsis more complicated than for continuous attributes. Once again, we apply the Boolean reasoningapproach to construct a partition of symbolic value sets into small number of subsets.Let A = (U;A [ fdg) be a decision table where A = nai : U ! Vai = nvai1 ; :::; vainioo, fori 2 f1; :::; kg. Any function Pai : Vai ! f1; : : : ;mig (where mi � ni) is called a partition of Vai .The rank of Pai is the value rank (Pai) = jPai (Vai)j. The function Pai de�nes a new partitionattribute bi = Pai � ai i.e. bi (u) = Pai (ai (u)) for any object u 2 U:Let B � A be an arbitrary subset of attributes. The family of partitions fPaga2B is calledB � consistent i�8u;v2U [d (u) 6= d (v) ^ (u; v) =2 IND(B)] ) 9a2B [Pa (a (u)) 6= Pa (a (v))] (2)

Hoa & Son /Pattern Extraction from Data 7It means that if two objects u; u0 are discerned by B and d, then they are discerned by theattribute partition de�ned by fPaga2B . We consider the following optimization problem calledthe symbolic value partition problem:Symbolic Value Partition Problem: For a given decision table A = (U;A [ fdg), and a setof nominal attributes B � A, search for a minimal B� consistent family of partitions (i.e.B-consistent family fPaga2B with the minimal value of Pa2B rank (Pa)).This concept is useful when we want to reduce attribute domains of with large cardina-lities. The discretization problem can be derived from the partition problem by adding themonotonicity condition for the family fPaga2A such that8v1;v22Va [v1 � v2 ) Pa (v1) � Pa (v2)]We discuss three approaches of the solution of this problem, namely the local partitionmethod, the global partition method and the "divide and conquer" method. The �rst approachis based on grouping the values of each attribute independently whereas the second approachis based on grouping of attribute values simultaneously for all attributes. The third method issimilar to the decision tree techniques: the original data table is divided into two subtables byselecting the "best binary partition of some attribute domain" and this process is continued forall subtables until some stop criteria is satis�ed.4.1. Local partitionThe local partition strategy is very simple. For any �xed attribute a 2 A, we search for apartition Pa that preserves the consistency condition (2) for the attribute a (i.e. B = fag).For any partition Pa the equivalence relation�Pa is de�ned by: v1 �Pa v2 , Pa (v1) = Pa (v2)for all v1; v2 2 Va. We consider the relation UNIa de�ned on Va as follows:v1UNIav2 , 8u;u02U (a(u) = v1 ^ a(u0) = v2) ) d(u) = d(u0) (3)It is obvious that the relation UNIa de�ned by (3) is an equivalence relation. Local partitionare created using the followingProposition 4.1. If Pa is a-consistent then �Pa� UNIa. The equivalence relation UNIade�nes a minimal a�consistent partition on a.4.2. Global partitionWe consider the discernibility matrix M (A) = [mi;j]ni;j=1 (see [17]) of the decision table A, wheremi;j = fa 2 A : a (ui) 6= a (uj)g is the set of attributes discerning two objects ui; uj . Observethat if we want to discern an object ui from another object uj we need to preserve one of theattributes in mi;j. To put it more precisely: for any two objects ui; uj there exists an attributea 2 mi;j such that the values a (ui) ; a (uj) are discerned by Pa.

8 Hoa & Son / Pattern Extraction from DataHence instead of cuts as in the case of continuous values (de�ned by pairs (ai; cj)), wecan discern objects by triples �ai; vaii1 ; vaii2 � called chains, where ai 2 A for i = 1; :::; k andi1; i2 2 f1; :::; nig :We can build a new decision table A+ = (U+; A+ [ fd+g) (analogously to the table A� (seeSection 3.2)) assuming U+ = U�; d+ = d� and A+ = f(a; v1; v2) : (a 2 A)^ (v1; v2 2 Va)g. Againe.g. the Johnson heuristic can be applied to A+ to search for a minimal set of chains discerningall pairs of objects in di�erent decision classes.It can be seen that our problem can be solved by e�cient heuristics of graph coloring. The"graph k�colorability" problem is formulated as follows:input: Graph G = (V;E), positive integer k � jV joutput: 1 if G is k�colorable, (i.e. if there exists a function f : V ! f1; : : : ; kg such thatf (v) 6= f (v0) whenever (v; v0) 2 E) and 0 otherwise.This problem is solvable in polynomial time for k = 2, but is NP-complete for all k � 3.However, similarly to discretization, some e�cient heuristic searching for optimal graph coloringdetermining optimal partitions of attribute value sets can be applied.For any attribute ai in a semi-minimal set X of chains returned from the above heuristic weconstruct a graph �ai = hVai ; Eaii, where Eai is the set of all chains of the attribute ai in X.Any coloring of all the graphs �ai de�nes an A-consistent partition of value sets. Hence heuristicsearching for minimal graph coloring returns also sub-optimal partitions of attribute value sets.The coresponding Boolean formula has O(knl2) variables and O(n2) clauses, where l is themaximal value of card(Va) for a 2 A. When prime implicants of Boolean formula have beenconstructed, a heuristic for graph coloring should be applied to generate new features,4.3. ExampleLet us consider the decision table presented in Figure 1 and a reduced form of its discernibilitymatrix.Firstly, from the Boolean function fA with Boolean variables of the form av2v1 (correspondingto the chain (a; v1; v2) described in Section 4.2) we �nd a shortest prime implicant: [aa1a2 ^ aa2a3 ^aa1a4 ^ aa3a4 ^ ba1a4 ^ ba2a4 ^ ba2a3 ^ ba1a3 ^ ba3a5 ], which can be represented by graphs (Figure 1). Nextwe apply a heuristic to color vertices of those graphs as it is shown in Figure 1. The colors arecorresponding to the partitions:Pa (a1) = Pa (a3) = 1; Pa (a2) = Pa (a4) = 2Pb (b1) = Pb (b2) = Pb (b5) = 1;Pb (b3) = Pb (b4) = 2and at the same time one can construct the new decision table (see Figure 1).The following set of decision rules can be derived from the table APif a(u) 2 fa1; a3g and b(u) 2 fb1; b2; b5g then d = 0 (supported by u1; u2; u4)if a(u) 2 fa2; a4g and b(u) 2 fb3; b4g then d = 0 (supported by u3)if a(u) 2 fa1; a3g and b(u) 2 fb3; b4g then d = 1 (supported by u5; u9)if a(u) 2 fa2; a4g and b(u) 2 fb1; b2; b5g then d = 1 (supported by u6; u7; u8; u10)

Hoa & Son /Pattern Extraction from Data 9A a b du1 a1 b1 0u2 a1 b2 0u3 a2 b3 0u4 a3 b1 0u5 a1 b4 1u6 a2 b2 1u7 a2 b1 1u8 a4 b2 1u9 a3 b4 1u10 a2 b5 1- M (A) u1 u2 u3 u4u5 bb1b4 bb2b4 aa1a2 , bb3b4 aa1a3 , bb1b4u6 aa1a2 , bb1b2 aa1a2 bb2b3 aa2a3 , bb1b2u7 aa1a2 aa1a2 , bb1b2 bb1b3 aa2a3u8 aa1a4 , bb1b2 aa1a4 aa2a4 , bb2b3 aa3a4 , bb1b2u9 aa1a3 , bb1b4 aa1a3 , bb2b4 aa2a3 , bb3b4 bb1b4u10 aa1a2 , bb1b5 aa1a2 , bb2b5 bb3b5 aa2a3 , bb1b5

aPa bPb d1 1 02 2 01 2 12 1 1 � uu ee@@@@@ �� eeu u ua1a3 a2a4 b5 b1 b2 b3b4 BBBBQQQQQQQBBBBBBB ��a b?aa1a2 ^ aa2a3 ^ aa1a4 ^ aa3a4 ^ ba1a4 ^ ba2a4 ^ ba2a3 ^ ba1a3 ^ ba3a5?

Pa (a1) = Pa (a3) = 1; Pa (a2) = Pa (a4) = 2Pb (b1) = Pb (b2) = Pb (b5) = 1;Pb (b3) = Pb (b4) = 2Figure 1 The decision table and the corresponding discernibility matrix. Coloring of attributevalue graphs and the reduced table.4.4. Divide and Conquer methodFor a �xed attribute a and an object set Z � U , we de�ne the discernibility degree of disjointsets V1; V2 of values from Va (denoted by Disca (V1; V2jZ)) byDisca(V1; V2jZ) = ��f(u1; u2) 2 Z2 : d (u1) 6= d (u2) ^ (a (u1) ; a (u2)) 2 V1 � V2g��Let P be an arbitrary partition of Va. For any two objects u1; u2 2 U we say that the pair ofobjects (u1; u2) is discerned by P if [d (u1) 6= d (u2)]^ [P (a (u1)) 6= P (a (u2))]. The discernibilitydegree of partition P over the set of objects Z � U is de�ned as a number of pairs of objectsfrom Z discerned by P . Using the same notation Disc (P jZ) we have:Disca (P jZ) = ��f(u1; u2) 2 Z2 : d (u1) 6= d (u2) ^ P (a (u1)) 6= P (a (u2))g�� (4)In this section, similarly to the decision tree approach, we consider the optimalization prob-lem called the binary optimal partition (BOP) problem which can be described as follows:

10 Hoa & Son / Pattern Extraction from DataBOP Problem: Given a set of objects Z and an attribute a. Find a binary partition P of Va(rank (P ) = 2) such that Disca (P jZ) is maximal.We will show that the BOP problem is NP-hard with respect to the size of Va: The proofwill suggest some natural searching heuristics for optimal partition. We apply these heuristic toconstruct the binary decision tree from symbolic attributes.4.4.1. Complexity of the binary partition problemIn this section we consider a problem of searching for an optimal binary partition of the domainof a �xed attribute a. Let us denote by s (V; �jZ) the number of occurrences of the values fromV among values a (u) for u 2 Z \C�, i.e.s (V; �jZ) = jfu 2 Z : (a (u) 2 V ) ^ (d (u) = �)gj :In case of rank (P ) = 2 (without loss of generality one can assume that P : Va ! f1; 2g)and Vd = f0; 1g, the discernibility degree of P is expressed byDisc (P jZ) = Xi2V1;j2V2[s (i; 0jZ) � s (j; 1jZ) + s (i; 1jZ) � s (j; 0jZ)] (5)where V1 = P�1 (1) , V2 = P�1 (2) and s (v; �jZ) = s (fvg ; �jZ). In this section we �x the setof objects Z and in the sequel, for simplicity of notation, Z will be omitted in the describedfunctions.To prove the NP-hardness of the BOP problem we consider the corresponding decision prob-lem called the binary partition problem which is de�ned as follows:Binary Partition (BP) Problem:Input: A value set V = fv1; : : : ; vng, two functions: s0; s1 : V ! N and a positive integer K.Question: Is there a binary partition of V into two disjoint subsets P (V ) = fV1; V2g such thatthe discernibility degree of P de�ned byDisc (P ) = Xi2V1;j2V2 [s0 (i) � s1 (j) + s1 (i) � s0 (j)]satis�es the inequality Disc(P ) � K.Obviously, if the BP problem is NP-complete, then the BOP problem is NP hard.Theorem 4.1. The binary partition problem is NP-complete.Proof:It is easy to see that the BP problem is in NP. The NP-completeness of the BP problem canbe shown by polynomial transformation from Set Partition Problem (SPP), which is de�ned asthe problem of checking for a given �nite set of positive integers S = fn1; n2; :::; nkg, if there isa partition of S into two disjoint subsets S1 and S2 such that Pi2S1 i = Pj2S2 j.It is known that the SPP is NP-complete [8]. We will show that SPP is polynomiallytransformable to BP. Let S = fn1; n2; :::; nkg be an instance of SPP. The corresponding instancesof the BP problem are:

Hoa & Son /Pattern Extraction from Data 11� V = f1; 2; :::; kg;� s0 (i) = s1 (i) = ni for i = 1; ::; k;� K = 12 kPi=1ni!2One can see that for any partition P of the set Va into two disjoint subsets V1 and V2 thediscernibility degree of P can be expressed by:Disc (P ) = Xi2V1;j2V2 [s0(i) � s1(j) + s1(i) � s0(j)] == Xi2V1;j2V2 2ninj = 2 � Xi2V1 ni � Xj2V2 nj =� 12 0@Xi2V1 ni + Xj2V2 nj1A2 = 12 Xi2V ni!2 = Ki.e. for any partition P we have the inequality Disc (P ) � K and the equality holds i� Pi2V1 ni =Pj2V2 nj: Hence P is a good partition of V (into V1 and V2) for the BP problem i� it de�nes agood partition of S (into S1 = fnigi2V1 and S2 = fnjgj2V2) for the SPP problem. Therefore theBP problem is NP-complete and the BOP problem is NP hard. utNow we are going to describe several approximate solutions for the BOP problem. Theseheuristics are quite similar to 2-mean clustering algorithms, but we use the discernibility degreefunction instead of Euclidean measure. First of all, let us explore several properties of the Discfunction.The function Disc (V1; V2) can be computed quickly from the function s, namelyDisc (V1; V2) = X�1 6=�2 s (V1; �1) � s (V2; �2)It is easy to observe that s (V; �) = Pv2V s (v; �), hence the function Disc is additive and symetric,i.e.: Disc (V1 [ V2; V3) = Disc (V1; V3) + Disc (V2; V3)Disc (V1; V2) = Disc (V2; V1)for arbitrary value sets V1; V2; V3.Let Va = fv1; :::; vmg be the domain of an attribute a. The grouping algorithm by minimizingdiscernibility of values from Va starts with the most detailed partition Pa = ffv1g ; :::; fvmgg,where for each V 2 Pa, a vector s (V ) = [s (V; 0) ; s (V; 1) ; :::; s (V; r)] of its occurrences indecision classes C0; C1; :::; Cr is associated. In every step we look for two nearest sets V1, V2 inPa with respect to the function Disc (V1; V2) and then we replace them by a new set V = V1[V2

12 Hoa & Son / Pattern Extraction from Datawith occurrence vector s (V ) = s (V1) + s (V2). The algorithm is runs until Pa contains two setsand then the suboptimal binary partition is stored in Pa.The second technique is called grouping by maximizing discernibility. The algorithm alsostarts with a family of singletons Pa = ffv1g ; :::; fvmgg, but �rst we look for two singletonswith the largest discernibility degree to create kernels of two groups; let us denote them byV1 = fv1g and V2 = fv2g. For any symbolic value vi =2 V1 [ V2 we compare the discernibilitydegrees Disc (fvig; V1) and Disc (fvig; V2) and attach it to the group with a smaller discernibilitydegree for vi. This process ends when all the values in Va are drawn out.4.4.2. Decision tree constructionIn [13] we have presented a very e�cient method for decision tree construction called the MDalgorithm. It has been developed for continuous (numeric) data only. In this section we extendthe MD algorithm to data with mixed attributes, i.e. numeric and symbolic attributes, by usingbinary partition algorithms presented in the previous Sections.Given a decision table A = (U;A [ fdg), we assume that there are two types of attributes:numeric and symbolic. We also assume that the type of attributes is given by a prede�nedfunction type : A! fN;Sg wheretype (a) = ( N if a is a Numeric attributeS if a is a Symbolic attributeWe can use grouping methods to generate a decision tree. The structure of the decision treeis de�ned as follows: internal nodes of a tree are labelled by logical test functions of the formT : U ! fTrue; Falseg and external nodes (leaves) are labelled by decision values.In this paper we consider two kinds of tests related to attribute types. In the case of symbolicattributes aj 2 A we use test functions de�ned by partition:T (u) = True() [aj (u) 2 V ]where V � Vaj . For numeric attributes ai 2 A we use test functions de�ned by discretization:T (u) = True() [ai (u) � c] () [ai (u) 2 (�1; ci)]where c is a cut in Vai . Below we present the process of decision tree construction. During theconstruction, we additionally use some object sets to label nodes of the decision tree. This thirdkind of labels will be removed at the end of construction process. The algorithm is presented inFigure 3.4.5. Pattern for Incomplete DataNow we consider a data table with incomplete value attributes. The problem is how to guessunknown values in a data table to guarantee maximal discernibility of objects in di�erent decisionclasses.

Hoa & Son /Pattern Extraction from Data 13��Z The best partition Pa = fV1; V2g over Z�� @@@@@R�� AAAAAA �� AAAAAA? a(u) 2 V2a(u) 2 V1u

L1 L2Figure 2. The decision treeALGORITHM1. Create a decision tree with one node labelled by the object set U;2. For each leaf L labelled by an object set Z dobegin3. For each attribute a 2 A doIf type (a) = S then search for the optimal binary partition Pa =fV1; V2g of Va;If type (a) = N then search for the optimal cut c in Va and set Pa =fV1; V2g where V1 = (�1; ci and V2 = (c;1);4. Choose an attribute a so that Disc (PajZ) is maximal and label thecurrent node by the formula [a 2 V1];5. For i := 1 to 2 doZi = fu 2 Z : a(u) 2 Vig;Li = ( vd if d(Zi) = fvdgZi if card (d(Zi)) � 26. Create two successors of the current node and label them by L1 andL2;end;7. If all leaves are labelled by decisions then STOP else goto Step 2;Figure 3. The decision tree construction algorithm

14 Hoa & Son / Pattern Extraction from DataThe idea of grouping values proposed in the previous sections can be used to solve thisproblem. We have shown how to extract patterns from data by using discernibility of objectsin di�erent decision classes. Simultaneously the information about values in one group can beused to guess the unknown values. Below we de�ne the searching problem for unknown valuesin an incomplete decision table.The decision table A = (U;A [ fdg) is called "incomplete" if attributes in A are de�ned asfunctions a : U ! Va [ f�g where for any u 2 U by a(u) = � we mean an unknown value of theattribute a. All values di�erent from � are called �xed values.We say that a pair of objects x; y 2 U is inconsistent if d(x) 6= d(y)^8a2A[a(x) = �_ a(y) =� _ a(x) = a(y)]. We denote by Conflict(A) the number of inconsistent pairs of objects in thedecision table A.The problem is to search for possible �xed values which can be substituted for the � valuein the table A in such a way that the number of con icts Conflict �A0� in the new table A0(obtained by changing entries � in table A into �xed values) is minimal.The main idea is to group values in the table A so that discernibility of objects in di�erentdecision classes is maximixed. Then we replace � by a value depending on �xed values belongingto the same group.To group attribute values we can use heuristics proposed in the previous sections. We assumethat all the unknown values of attributes in A are pairwise di�erent and di�erent from the �xedvalues. Hence we can label the unknown values by di�erent indices before applying algorithmsproposed in the previous Sections. This assumption allows to create the discernibility matrix foran incomplete table as in the case of complete tables and we can then use the Global Partitionmethod presented in Section 4.2 for grouping unknown values.The function Disc(V1; V2) can also be computed for all pairs of subsets which may containunknown values. Hence we can apply both heuristics of Dividing and Conquer methods forgrouping unknown values.After the grouping step, we assign to the unknown value one (or all) of the �xed values in thesame group which contains the unknown one. If there is no �xed value in the group we choosean arbitrary value (or all possible values) from an attribute domain, that does not belong toother groups. If such values do not exist either, we can say that these unknown values have noin uence on discernibility in a decision table and we can assign to them an arbitrary value fromthe domain.5. Experimental resultsExperiments for classi�cation methods have been carried over decision tables using two tech-niques called "train-and-test" and "n-fold-cross-validation". In Table 2 we present some experi-mental results obtained by testing the proposed methods for classi�cation quality on well knowndata tables from the "UC Irvine repository" and execution times. Similar results obtained byaltenative methods are reported in [7]. It is interesting to compare those results with regard to

Hoa & Son /Pattern Extraction from Data 15Names of Classi�cation accuraciesTables S-ID3 C4.5 MD MD-GAustralian 78.26 85.36 83.69 84.49Breast (L) 62.07 71.00 69.95 69.95Diabetes 66.23 70.84 71.09 76.17Glass 62.79 65.89 66.41 69.79Heart 77.78 77.04 77.04 81.11Iris 96.67 94.67 95.33 96.67Lympho 73.33 77.01 71.93 82.02Monk-1 81.25 75.70 100 93.05Monk-2 69.91 65.00 99.07 99.07Monk-3 90.28 97.20 93.51 94.00Soybean 100 95.56 100 100TicTacToe 84.38 84.02 97.7 97.70Average 78.58 79.94 85.48 87.00Table 2 The quality comparison between decision tree methods. MD: MD-heuristics; MD-G:MD-heuristics with symbolic value partitionboth classi�cation quality and execution time.6. ConclusionsWe have presented symbolic (nominal) value partition problems as alternatives to discretizationproblems for numeric (continuous) data. We have discussed their computational complexitiesand proposed a number of approximate solutions for them. The proposed solutions are obtainedby using well known existing approaches like Boolean reasoning, entropy, decision trees. Someof those methods can be used together with discretization methods for data containing bothsymbolic and numeric attributes. In our system for data analysis we have implemented e�cientalgorithms based on the methods discussed. The tests show that they are very e�cient fromthe point of view of time complexity. They also assure high quality of recognition of new unseencases. The proposed heuristics for symbolic value partition allow to obtain a more compressedform of the decision algorithm. Hence, due to the minimum description length principle, we canexpect them to return decision algorithms with high quality of unseen object classi�cation.Acknowledgment: This work was supported by the Polish State Committee for Scienti�cResearch grant #8T11C01011 and Research Program of the European Union - ESPRIT-CRIT2No. 20288

16 Hoa & Son / Pattern Extraction from DataReferences[1] Brown F.M. (1990). Boolean reasoning, Kluwer, Dordrecht.[2] Catlett J. (1991). On changing continuos attributes into ordered discrete attributes. In: Y.Kodrato� (ed.), Machine Learning-EWSL-91, Porto, Portugal, LNAI, pp. 164-178.[3] Chmielewski M.R., Grzymala{Busse J.W. (1994). Global Discretization of Attributes asPreprocessing for Machine Learning. Proc. of the III International Workshop on RSSC94 ,pp. 294-301.[4] Dougherty J., Kohavi R., Sahami M.(1995). Supervised and Unsupervised Discretizationof Continuous Features, Proc. of the 20th International Conference on Machine Learning,Morgan Kaufmann, San Francisco, CA, pp. 194-202.[5] Fayyad U. M., Irani K.B. (1992). The attribute selection problem in decision tree generation.Proc. of AAAI-92, San Jose, CA.MIT Press, pp. 104-110.[6] Fayyad, U. M., Irani, K.B. (1993). Multi-interval discretization of continuous-valued at-tributes for classi�cation learning. "Proc. of the 13th International Joint Conference on Ar-ti�cial Intelligence", Morgan Kaufmann, pp. 1022-1027.[7] Friedman, J., Kohavi, R., Yun Y., Lazy Decision Trees. In: AAAI-96, p. 717-724.[8] Garey, M.R., Johnson, D.S. (1979). Computers and Intractability, A guide to the theory ofNP-completeness. W.H. Freeman, San Francisco.[9] Kerber R. (1992), Chimerge: Discretization of numeric attributes. Proc. of the Tenth Na-tional Conference on Arti�cial Intelligence, MIT Press, pp. 123-128.[10] Kodrato� Y., Michalski R. (1990). Machine learning: An Arti�cial Intelligence approach,vol.3, Morgan Kaufmann, 1990.[11] Nguyen H. S, Skowron A. (1995). Quantization of real values attributes, Rough set andBoolean Reasoning Approaches. Proc. of the Second Joint Annual Conference on InformationSciences, Wrightsville Beach, NC, pp 34-37.[12] Nguyen S. H., Nguyen H. S. (1996). Some E�cient Algorithms for Rough Set Methods. Proc.of the Conference of Information Processing and Management of Uncertainty in Knowledge-Based Systems IPMU'96, Granada, Spain, pp. 1451-1456.[13] Nguyen, H.S., Nguyen, S.H. (1997). Discretization Methods for Data Mining. In A.Skowronand L.Polkowski (Eds.), Rough Set in Data Mining and Knowledge Discovery. Berlin, SpringerVerlag.[14] Nguyen, S.H., Skowron A., Synak P. (1997). Discovery of Data Patterns with Applicationsto Decomposition and Classi�cation Problems. In A.Skowron and L.Polkowski (Eds.), RoughSet in Data Mining and Knowledge Discovery. Berlin, Springer Verlag.[15] Pawlak Z. (1991). Rough sets: Theoretical aspects of reasoning about data, Kluwer Dor-drecht.[16] Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Mateo. CA: MorganKaufmann Publishers.[17] Skowron A., Rauszer C. (1992). The Discernibility Matrices and Functions in InformationSystems. In: S lowi�nski R.(ed.) Intelligent Decision Support-Handbook of Applications andAdvances of the Rough Sets Theory, Kluwer Dordrecht, 331-362.

Pattern Extraction from Data

Documents