-
Constructing a Fuzzy Decision Tree by Integrating Fuzzy Sets and
Entropy
TIEN-CHIN WANG (王天津 HSIEN-DA LEE(李賢達)
Department of Information Management I-Shou University
Kaohsiung, Taiwan
Abstract: - Decision tree induction is one of common approaches
for extracting knowledge from a sets of feature-based examples. In
real world, many data occurred in a fuzzy and uncertain form. The
decision tree must able to deal with such fuzzy data. This paper
presents a tree construction procedure to build a fuzzy decision
tree from a collection of fuzzy data by integrating fuzzy set
theory and entropy. It proposes a fuzzy decision tree induction
method for fuzzy data of which numeric attributes can be
represented by fuzzy number, interval value as well as crisp value,
of which nominal attributes are represented by crisp nominal value,
and of which class has confidence factor. It also presents an
experiment result to show the applicability of the proposed method.
Key-Words: Fuzzy Decision Tree, Fuzzy Sets, Entropy, Information
Gain, Classification, Data Mining 1 Introduction
Decision trees have been widely and successfully used in machine
learning. More recently, fuzzy representations have been combined
with decision trees. Many methods have been proposed to construct
decision trees from collection of data. Due to observation error,
uncertainty, and so on, many data collecting in real world are
obtained in fuzzy forms. Fuzzy decision trees treat features as
fuzzy variables and also yield simple decision trees. Moreover, the
use of fuzzy sets is expected to deal with uncertainty due to noise
and imprecision. The researches on fuzzy decision tree induction
for fuzzy data have not yet sufficiently performed. This paper is
concerned with a fuzzy decision tree induction method for such
fuzzy data. It proposes a tree-building procedure to construct
fuzzy decision tree from a collection of fuzzy data.
Decision trees and decision rules are data-mining methodologies
applied in many real-world applications as a powerful solution to
classification problem [1]. Classification is a process of learning
a function that maps a data item into one of several predefined
classes. Every classification based on inductive-learning
algorithms is given as input a sets of samples that consist of
vectors of attribute values and a corresponding class. For example,
a simple classification might group students into three groups
based on their scores: (1) those students whose scores are above 90
(2) those students whose scores are between 90 and 70 and (3) those
students whose scores are below 70.
1.1 Fuzzy set theory Fuzzy set theory was first proposed by
Zadeh to
represent and manipulate data and information that posses
non-statistical uncertainty. Fuzzy set theory is primarily
concerned with quantifying and reasoning using natural language in
which words can have ambiguous meanings. This can be thought of as
an extension of traditional crisp sets, in which each element must
either be in or not in a set. Fuzzy sets are defined on a non-fuzzy
universe of discourse, which is an ordinary sets. A fuzzy sets F of
a universe of discourse U is characterized by a membership function
)(xFµ which assigns to every element
Ux ∈ ,a membership degree ]1,0[)( ∈xFµ . An element Ux∈ is said
to be in a fuzzy sets F if and only if 0)( >xAµ and to be a full
member if and only if 1)( =xFµ [5]. Membership functions can either
be chosen by the user arbitrarily, based on the user’s experience,
or they can be designed by using optimization procedures[6][7].
Typically, a fuzzy subset A can be represented as,
{ } { } { nnAAA xxxxxxA /)(,...,/)(,/)( 2211 }µµµ= Where the
separating symbol / is used to associate the membership value with
its coordinate on the horizontal axis. For example, in Fig.1, let
F=integers close to 10; then one choice for )(xFµ is expressed
as
12/0.011/5.0.10/19/5.0.8/0.0 ++++=F
Proceedings of the 5th WSEAS International Conference on Applied
Computer Science, Hangzhou, China, April 16-18, 2006
(pp306-311)
-
Fig. 1. Triangular Membership
function expression for a number closed to 10
1.2 Fuzzy Decision Trees A decision tree[4][8] is a formalism
for
expressing mapping from attribute values to classes and consists
of tests or attribute nodes linked to two or more subtrees and
leafs or decision nodes labeled with a class which indicates the
decision. The main advantage of decision-tree approach is it
visualizes the solution; it is easy to follow any path through the
tree. Relationships discovered by a decision tree can be expressed
as a set of rules, which can then be used in developing an expert
system. A decision tree model employs a recursive
divide–and-conquer strategy to divide the data set into partitions
so that all of the records in a partition have the same class
label[9]. In classical decision trees, nodes make a data follow
down only one branch since data satisfies a branch condition, and
the data finally arrives at only a leaf node. In tree-structured
representations, a set of data is represented by a node, and the
entire data set is represented as a root node. When a split is
made, several child nodes, which correspond to partitioned data
subsets, are formed. If a node is not to be split any further, it
is called a leaf; otherwise, it is an internal node. Decision trees
classify data by sorting them down the tree from the root to leaf
nodes. As the typical kinds of decision tree induction algorithms,
there are ID3 and CART [10][11]. Decision trees were popularized by
Quinlan with the ID3 algorithm. Systems based on ID3 work well in
symbolic domains. A large variety of extensions to the basic ID3
algorithm have been developed by different researchers. ID3 is
designed to deal with symbolic domain data, and the data finally
arrives at only a leaf node. The algorithm is applied recursively
to each child node until all samples at a node are belongs to a
class. Fuzzy decision trees allow data to follow down
simultaneously multiple branches of a node with different
satisfaction degrees ranged on [0,1][12]. CART is designed to deal
with continuous numeric domain data. A number of alternation of
them have been developed. Fuzzy decision tree is one of them.
Fuzzy decision trees attempt to combine elements of symbolic and
sub-symbolic approaches. Fuzzy sets and fuzzy logic allow modeling
language-related uncertainties, while providing a symbolic
framework for knowledge comprehensibility. Fuzzy decision trees
differ from traditional crisp decision trees in three respects
[10]: (1) They use splitting criteria based on fuzzy restrictions.
(2) Their inference procedures are different. (3) The fuzzy sets
representing the data have to be defined.
Fuzzy decision tree induction has two major components: a
procedure for fuzzy decision tree building and an inference
procedure for decision making [13]. It is required to develop the
following things to apply an ID3-like procedure to fuzzy decision
tree construction: attribute value space partitioning methods,
branching attribute selection method, branching test method to
decide which degree data follows down branches of a node, and leaf
node labeling methods to determine classes for which leaf nodes
stand.
1.3 Entropy Heuristics Attribute selection in ID3 and C4.5
algorithms are
based on minimizing an information entropy measure applied to
the examples at a node [1]. The entropy measure is used to
calculate the information gain which reflects the quality of an
attribute as the branching attribute. The attribute-selection part
of ID3 is base on the assumption that the complexity of the
decision tree is strongly related to the amount of information
conveyed by the value of the given attribute. An information-based
heuristic selects the attribute providing the highest information
gain. A data set with some discrete-valued condition attributes and
one discrete-valued decision attributes can be presented in the
form of knowledge representation system ,
where
),( DCUJ ∪={ }suuuU ....,, 21= is the set of data samples,
{ }ncccC ....,, 21= is the set of condition attributes and { }dD
= is the one-elemental set with the decision attribute or class
label attribute. Suppose this class label attribute has m distinct
values
defining m distinct classes , (for i=l, ..,m), let
be the number of samples of U in class .The
id is
id
Proceedings of the 5th WSEAS International Conference on Applied
Computer Science, Hangzhou, China, April 16-18, 2006
(pp306-311)
-
expected information or entropy need to classify a given sample
is given by
∑=
−=m
iiim ppssI
121 log),...( (1)
Where is the probability that an arbitrary sample belongs to
class and is estimated by summation those samples’ entropy (m is
the number of all samples). Let attribute have v distinct value
ip
is
ic{ }vAAA ....,, 21 , attribute can be used to partition U into
v subsets { where (j=1,..,v) contains those samples in U that have
value of .
Let be the number of samples of class in a
subset , the entropy of attribute is given by
ic}vSSS ....,, 21 iS
jA ic
ijs id
jS ic
∑=
+=
v
jmjj
mjji ssIs
sscE
11
1 ),...(...
)( (2)
The term s
ss mjj ++ ....1 acts as the weight of the
jth subset and is the number of samples in the subset divided by
the total number of samples. The smaller the entropy value, the
greater the purity of the subset partitions. Thus the attribute
that leads to the largest information gain, is selected as the
branching attribute. For a given subset ,the information gain is
expressed as
jS
(3) ∑=
−=m
iijijmjj ppssI
121 log),...(
Where j
ijij S
sp = ( jS is the number of
samples in the subset ) and is the probability that
a sample in belongs to class . So information
gain of attribute is given by
jS
jS id
ic)(),...()( 1 imjji cEssIcGain −= (4)
We compute the information gain of each condition attribute, the
attribute with the highest information gain is the most informative
and the most discriminating attribute of the given set.
2 Experiment
In this section, an example is given to illustrate the proposed
fuzzy decision tree algorithm. This sample is intended to show
fuzzy decision tree algorithm can be used to evaluate student
admission
for graduate school. The data set includes 10 applicants, as
shown in Table 1
Table 1.The data set of students
Student no. GPA ETS WE Ref.
Admission
1 3.2 75 Fair Yes Yes 2 2.8 52 Excellent N/A No 3 2.7 69 Fair
Yes No 4 3.6 86 Excellent Yes Yes 5 2.1 63 Fair Yes No 6 2.6 91
Fair N/A Yes 7 2.8 63 Excellent Yes No 8 2.3 77 Fair Yes No 9 3.6
68 Fair Yes Yes 10 3.5 90 Fair N/A Yes
Each case consists of four condition attributes: grade point
average (denoted GPA), entrance test score (denoted ETS), working
experience (denoted WE), and reference (denoted Ref).
In this example, triangular membership functions are used to
represent fuzzy sets because of its simplicity, easy comprehension,
and computational efficiency. Membership functions are usually
predefined by experienced experts. They also can be derived through
automatic adjustments [14].
From Fig.2 and Fig.3, GPA and ETS attribute have three fuzzy
regions: Low, Middle, and High. Thus, three fuzzy membership values
are produced for each course score according to the predefined
membership functions.
Fig. 2. The membership function for examinees’ GPAs
Proceedings of the 5th WSEAS International Conference on Applied
Computer Science, Hangzhou, China, April 16-18, 2006
(pp306-311)
-
Fig. 3. The membership function for examinees’ scores
3 Problem Solution
For the experimental data in Table 1, the decision-tree
construction algorithm proceeds in following subsections.
3.1 Calculate Information Gain STEP 1. To represent a continuous
fuzzy set , we
need to express it as a function and then map the elements of
the set to their degree of membership[3]. Transform the
quantitative values of each examinee’s score into fuzzy sets. Take
the Entrance Test Score (ETS) for example, the score “85” can be
converted into a fuzzy set (0.0/Low + 0.0/Middle + 0.5/High) using
the predefined membership functions in Fig.2. The transformation
procedure is repeated for the other scores. The result is shown in
Table 2.
Table 2.The data set of students in fuzzy form
no. GPA ETS WE Ref. Admission 1 Middle Middle Fair Yes Yes 2
Middle Low Excellent N/A No 3 Middle Middle Fair Yes No 4 High High
Excellent Yes Yes 5 Low Low Fair Yes No 6 Middle High Fair N/A Yes
7 Middle Low Excellent Yes No 8 Low Middle Fair Yes No 9 High
Middle Fair Yes Yes 10 High High Fair N/A Yes
STEP 2. Form a knowledge representation system { } { } {
,.,,,,10,..1,, REFWEETSGPACUDCUJ }==∪={ }AdmissionD = . The class
label attribute is
admission, has two distinct values { . There are two distinct
classes (m=2), let class represents
yes and class represents no, there are 5 samples of class yes
and 5 samples of class no, so
}noyes,1d
2d
( ) 1105log
105
105log
105, 2221 =−−=ssI formula(1)
STEP 3. Compute the entropy for each attribute,
for attribute GPA, it has three distinct values { }LowMiddleHigh
,, ,U can be partitioned into three subsets { }321 ,, sss For
GPA=”High” =3 =0 11s 21s
( ) 0033log
33, 22111 =−−=ssI formula(3)
For GPA=”Middle” =2 =3 12s 22s
( ) 971.053log
53
52log
52, 222212 =−−=ssI formula(3)
For GPA=”Low” =0 =2 13s 23s
( ) 022log
220, 23113 =−=ssI formula (3)
( ) ( ) ( ) 485.0,*102,*
105,*
103)( 311322212111 =++= ssIssIssIGPAE
formula(2) ( ) 514.0)(,)( 21 =−= GPAEssIGPAGain
formula(4)
STEP 4. Same as STEP 3 to compute Gain(ETS)=0.6,
Gain(WE)=0.3389, Gain(Ref)=0.05. Since ETS has the highest
information gain among the four attributes, so ETS is selected as
the attribute to split the tree. 3.2 Constructing a Decision
Tree
We use the selected condition attribute: ETS to form the
decision tree. So, we get the following equivalence classes:
high: { }10,6,4 middle: { low: }9,8,3,1 { }7,5,2 The subset
class middle: { }9,8,3,1 needs to
further split. Following the algorithm expressed in section 2.1,
the attribute GPA has the highest information gain to split the
tree. Then the whole decision tree has been completed as Fig.4.
Proceedings of the 5th WSEAS International Conference on Applied
Computer Science, Hangzhou, China, April 16-18, 2006
(pp306-311)
-
Fig. 4. Decision tree based on information gain.
3.3 Extract classification rules
Data classification is an important data mining task[2] that
tries to identify common characteristics in a set of N objects
contained in a database and to categorize them into different
groups. We extract classification IF-THEN rules from those
equivalence classes. For equivalence class { ,those samples all
have the identical attribute values:
}10,6,4
ETS=high, Admission=yes So, we use the condition attribute
values
(ETS=high) as the rule antecedent and use the class label
attribute value (Admission= yes) as the rule consequent, we can get
the following classification rule:
IF ETS=”high” THEN Admission=”yes” Similarly, the other
classification rules can be
extracted at this manner. We can get those rules as follows:
1. IF ETS=”high” THEN Admission=”yes” 2. IF ETS=”low” THEN
Admission=”no” 3. IF ETS=”middle” AND GPA=”high” THEN
Admission=”yes” 4. IF ETS=”middle” AND GPA=”low” THEN
Admission=”no” 4 Conclusion
The paper is concerned with fuzzy sets and decision tree. We
present a fuzzy decision tree model based on fuzzy set theory and
information theory. It proposes a fuzzy decision tree induction
method for fuzzy data of which numeric attributes can be
represented by fuzzy number, interval value as well
as crisp value, of which nominal attributes are represented by
crisp nominal value, and of which class has confidence factor. An
example is used to prove the validity. First, we applied fuzzy set
theory to transform real-world data into fuzzy linguistic forms.
Secondly, we used information theory to construct a decision tree.
Finding the best split point and performing the split are the main
tasks in decision tree induction method. Through the integration of
both fuzzy set theory and information theory, it can make
classification tasks originally thought too difficult or complex to
become possible. It provides an alternative for evaluating the best
possible candidates.
1-10 ETS?
high low middle
2,5,7 no 1,3,8,9 GPA? 4,6,10 yes
low high middle
8 no 9 yes 1, 3 ?
References: [1] Mehmed Kantardzic, Data Mining, Concept,
Models, Methods, and Algorithms, Wiley Publishers, 1993.
[2] U.M. Fayyad, G.Piatesky-Shapiro and P. Smith, From Data
Mining to Knowledge Discovery in Knowledge Discovery and Data
Mining, AAAI/MIT Press, 1996.
[3] Michael Negnevitsky, Artificial Intelligence, Addison
Wesley, 2002.
[4] Stuart J. Russel, Peter Norvig, et al: Artificial
Intelligence: a Modern Approach, Englewood Cliffs,
Prentice-Hall,1995
[5] H. J. Zimmermann, Fuzzy Set Theory and Its Applications,
Kluwer Academic Publishers, 1991.
[6] Jang, J.S. R., Self-Learning Fuzzy Controllers Based on
Temporal Back-Propagation, IEEE Trans. On Neural Network, Vol. 3,
September, 1992, pp. 714-723.
[7] Horikowa, S., T. Furahashi and Y. Uchikawa, On Fuzzy
Modeling Using Fuzzy Neural Networks with Back-Propagation
Algorithm, IEEE Trans. on Neural Networks, Vol.3, Sept., 1992, pp.
801-806.
[8] J.R. Quinlan: C4.5: Programs for Machine Learning, Morgan
Kaufmann Publishers, San Mateo, CA, 1993
[9] Shu-Tzu Tsai, Chao-Tung Yang, Decision Tree Construction for
Data Mining on Grid Computing, IEEE International Conference on
e-Technology, e-Commerce and e-Service, 2004.
[10] C. Z. Janikow, Fuzzy Decision Trees: Issues and Methods,
IEEE Trans. on Systems, Man, and Cybernetics -Part B, February
1998, Vo1.28, No.1, pp.1-14.
[11] J. Jang, Structure determination in fuzzy modeling: A fuzzy
CART approach, Proc. IEEE Conf on Fuzzy Systems, 1994,
pp.480-485.
Proceedings of the 5th WSEAS International Conference on Applied
Computer Science, Hangzhou, China, April 16-18, 2006
(pp306-311)
-
[12] R.L.P. Chang, T. Pavlidis, Fuzzy Decision Tree Algorithms,
IEEE Trans. on Systems, Man, and Cybernetics,Vol.7, No.1, 1977,
pp.28-35.
[13] Koen-Myung Lee, Kyung-Mi. Lee, Jee-Hyung Lee, Hyung
Lee-Kwang, A Fuzzy Decision Tree Induction Method for Fuzzy Data,
IEEE International Fuzzy Systems Conference Proceedings, Vol.1,
August 1999, pp.16-21.
[14] T.P. Hong, C.H. Chen, Y.L. Wu, Y.C. Lee, Using Divide-and-
Conquer GA Strategy in Fuzzy Data Mining, The Ninth IEEE Symposium
on Computers and Communications, 2004.
Proceedings of the 5th WSEAS International Conference on Applied
Computer Science, Hangzhou, China, April 16-18, 2006
(pp306-311)