Data mining using cultural algorithms and regional schemata ...

Data Mining Using Cultural Algorithms and Regional Schemata∗

∗ Research supported by a National Science Foundation Grant in Information and Intelligent Systems, NSF IIS-9907257.

Xidong Jin Innovative Software Systems, Inc.

Ann Arbor, MI 48103

Robert G. Reynolds Department of Computer Science

Wayne State University Detroit, MI 48202

Abstract

In the paper we demonstrate how evolutionary search for functional optima can be used as a vehicle for data mining. That is, in the process of searching for optima in a multi-dimensional space we can keep track of the constraints that must be placed on related variables in order to move towards the optima. Thus, a side effect of evolutionary search can be the mining of constraints for related variables. Here we use a Cultural Algorithm framework to embed the search and store the results in regional schemata. An application to a large-scale real-world archaeological data set is presented. 1. Introduction

Knowledge Discovery and Data Mining (KDD) is an

interdisciplinary area focusing upon methodologies for extracting hidden knowledge from large-scale datasets. KDD can include many sub-phases that often involve significant iterations, such as data cleaning, data selection, and Data Mining (DM). DM is viewed as the essential phase in KDD where intelligent methods are used in order to extract data patterns. A data mining system can perform several data mining tasks, such as classification, association, prediction, clustering, time-series analysis, and outlier analysis [1]. KDD supporting multiple DM tasks are necessary to mine large-scale databases. Data Mining integrates technologies from multiple disciplines, such as database systems, statistics, machine learning, and data visualization.

In this paper we apply evolutionary techniques based upon Cultural Algorithms to expedite data mining in large-scale temporal-spatial databases. The evolutionary computation search will be used to “naturally select” the necessary cases from a large-scale database to avoid the exhaustive search of every case in that database. Traditional Evolutionary Computation methods have limited/implicit mechanisms for representing and reasoning about knowledge. Since a Cultural Algorithm has a

knowledge-based component, it has the potential to perform data mining tasks. Previous papers have shown that regional schemata can be used to represent regional knowledge, including constraint knowledge [2, 3]. Here we use regional schemata to record the evolution of knowledge that is extracted during the evolutionary search for solving constrained problems. Therefore, in this model, the process of solving a constrained problem can be viewed as not only a search process, but also a potential data mining process. This data mining process can be used to extract some unknown knowledge about the functional landscape patterns during the optimization search. From another point of view, there are two reciprocal and symbiotic processes using cultural algorithms and regional schemata: an evolutionary optimization process in the population space to solve a general constrained problem and a data mining process in the belief spaces to evolve or mine the regional knowledge. There are many advantages of doing this. First, evolutionary search can help to “select” the individual in the search space efficiently and naturally, which can cause the data mining process to be more efficient. Secondly, in this model, there is a communication protocol between the belief space and the population space, so that the evolutionary search process and the data mining process are integrated seamlessly, hence they can benefit from each other.

Cultural Algorithms have explicit mechanisms to represent knowledge obtained during the search. One of our aims is to make the evolutionary search process and data mining process integrated and reciprocal: let the mined knowledge in belief space contribute to the evolutionary search in databases, and let the evolutionary search in the databases facilitate the patterns formation in the belief space.

This paper is arranged as follows. Section 2 reviews the framework of Cultural Algorithms from a data-mining point of view. Sections 3 through 6 describe the Cultural Algorithms’ functions used to implement this framework. Section 3 introduces the belief space representation, and some concepts for real-world data mining applications, such as indicators and dimension control. Section 4

Proceedings of the 14th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’02) 1082-3409/02 $17.00 © 2002 IEEE

describes the acceptance function. Section 5 describes how to update the knowledge in the belief space. Section 6 presents the influence function. In Section 7, the approach is applied into a real-world project.

2. A Cultural Algorithm Framework: from a

Data-Mining View

Evolving Population Space

A general framework (Fig. 1) is given in which to

conduct data mining tasks using Cultural Algorithms. In this framework, the population space consists of all the records in the database (each record is an individual), and the belief space consist of a set of hierarchically structured regional schemata. In the population space, the task of the evolutionary search is to find the “fittest” record in the database based on certain standards (objective functions), so it can be basically viewed as a process to solve a general constrained problem. In the belief space, the task is to evolve the knowledge acquired from the population space, and this is a data-mining process that extracts some unknown patterns. The evolutionary optimization and the data mining are integrated by a protocol between the belief space and the population space. The population space in this dual-inheritance system provides collective experiences to the knowledge evolution in the belief space via the acceptance function. The belief space facilitates the evolutionary search in the population space by providing global knowledge via the influence function. Therefore, both the optimization process and the data mining

processes can symbiotically benefit from the existence of each other.

In order to mine knowledge from databases using Cultural Algorithms, the obtained knowledge should be represented and reasoned about during the process. The knowledge representation formats can be binary schemata, interval schemata or regional schemata. In databases, there are many real-valued fields (parameters), so binary schemata are not suitable to be used as a general tool to show the dependence of the real-valued parameters. Interval schemata have been used in the field of knowledge-based re-engineering [4]. However, Interval schemata do not explicitly address the dependence among the parameters. Regional schemata [2, 3] extend the interval schemata, and they can be used to represent complicated logical relationships (both the independent and the interdependent relationships) among the parameters. Therefore, the regional schemata will be used as a general tool to represent the knowledge here.

update()

3. The Belief Space Representation

The configuration here is based on the Hierarchical Architecture Model, so the belief space is also represented by a hierarchically structured collection of self-organized regional schemata. For a problem with r variables, we can maintain a tree-structured collection of r-dimensional regional schemata. Each of these regional schemata has a set of attributes. We discuss the self-maintained structure first, the attributes next, and then we give some extensions for the real-world applications.

3.1. Hierarchical knowledge structure

The belief space maintains a tree-structured regional

schema. These regional schemata can be represented as C, C=(C1… Ci… Cl). C is a set of regional schemata, and l is the current number of the schemata. Each schema, such as Ci, is used to represent the regional knowledge of a corresponding search space cell. This tree structure can be self-organized by mechanism such as the fuse()/split() function.

3.2. Attributes of the regional schemata

Each schema has a set of attributes. Without loss of

generality, a schema can be denoted as: Ci = (Classi, Wi, Lefti, Csizei, Parenti, Childreni, Crecordi ).

Classi represent the regional identifications (features) of the corresponding search space cell, e.g. feasible/infeasible, good-/poor-performing, etc. For the real-world application in section 7, Classi are classified into 4 categories. These 4 categories (values) are good, bad, mixed, and average. If a search space cell contains both top-performing individuals

Fig.1. A data mining framework using Cultural Algorithms

accept( ) influence( )

Evolving Belief Space

Schemata:

Knowledge indicators

Evolution directors

generate( )*

select( )* obj( )


and the below-average individuals, the cell and the corresponding regional schema are called as mixed. Both the top-performing individuals and the below-average individuals are selected by the acceptance function. If a search space cell contains only the top-performing individuals, the cell and the corresponding regional schema are good. If a search space cell contains only the below-average individuals, the cell and the corresponding regional schema are bad. If a search space cell does not contain any top-performing individuals or below-average individuals, the cell and the corresponding regional schema are average.

3.3. Extensions for real-world applications

Some new concepts are now proposed for real-world

data mining applications. For example, dimension handling, indicator, director, and extensive indicator. • Dimension Handling

The maintenance of the schemata is flexible. For a problem with r variables, we don’t have to limit our choices to the r-dimensional regional schemata only. Since any m-directional (m<r) regional schemata can be viewed as a special case of the general r-dimensional schemata, we can alternatively choose to maintain a series of selected m-dimensional regional schemata instead. For example, we can manage a series of selected 2-dimensional or 1-demensional regional schemata instead. We consider the 2-dimensional schemata to be very important, because they are good for visualization, an important research field in data mining. • Indicator/Director

The regional schemata that are used to describe the extracted knowledge are referred as indicators. The schemata that can be used to influence the search in the population space are referred as directors. Some schemata can be used as indicators, and others can be both indicators and directors. • Extensive indicator

In real-world applications, some data fields are not real-valued. In these cases, some statistical representations, such as percentages, can also be recorded in the hierarchical model in order to describe the extracted knowledge. They are called extensive indicators. Some examples will be shown lately.

4. The Acceptance Function

The acceptance function controls the information flow

from the population space to the belief space. accept() selects the individuals that can directly impact the formation of the knowledge saved in the belief space. For the application here the acceptance function selects the good-individuals (the top αg% individuals) to determine the

good cells, while it selects the bad-individuals (the worst-performing αb% individuals) to classify the bad cells. αg and αb are the selection rates for good or bad individuals, respectively.

5. Knowledge Update 5.1. Knowledge structure update

The tree-structured knowledge saved in the belief space can be self-adjusted by split() function and fuse() function.

Split(): This function determines how to decompose a cell. Many methods can be used, here we use the quad-tree decomposition, where a cell is split into two equal parts along each of the r dimension.

When_to_Split: This function provides rules that determine when to split a cell. For the real-world archaeological data mining application in section 7, we think that the following conditions are important in identify a cell for splitting.

• A cell contains both good-individuals and bad-individuals, i.e. a mixed cell.

• There are too many individuals contained within this cell.

Fuse(): This function determines how to fuse cells together. For the application in section 7, we fuse sub-cells into a larger cell.

When_to_Fuse(): This function is used to determine when to fuse. The fusion can take place under many circumstances. For example, when all of the sub-cells have the same constraint features and their fitness values don't vary much.

5.2. Regional schemata update

Regional Schemata can be updated when new

information comes from the population space via the acceptance function.

The attribute Classi can be updated into 4 categories such as good, bad, mixed, and average here. The update process are based on the following.

∃∃∃¬∃

¬∃∃

=

otherwiseiduals)good_indiv ( and duals)bad_indivi( if

duals)bad_indivi ( and iduals)good_indiv ( if duals)bad_indivi ( and iduals)good_indiv ( if

averagemixedbadgood

Classi

Where Classi indicates the regional knowledge in the

corresponding search space cell. good-individuals and bad-individuals are used to refer to the top-performing individuals and below-average individuals respectively, both of which are selected by the acceptance function. Wi , which indicates the degree of the importance of the


corresponding cell, can be updated directly according to the value of Classi. Csizei represents the size of that search space cell, and it can be explicitly adjusted by the split()/fuse() functions as previously described in section 5.1.

6. The Influence Function

The influence function controls the information flow

from the belief space to the population space. This information (regional knowledge) can be applied to guide the evolutionary search in the population space. We use the strategy of moving the generation of individuals from unpromising search space cells into more promising ones. We employ the following basic rules: • The individuals in promising cells, such as in the good

or mixed cells, create their offspring nearby by perturbing their parameters a little.

• The individuals in the average or bad cells create the offspring by strategically and probabilistically changing their values in order to move to the good or mixed cells. The process can be specified as the following three cases.

6.1. If a parent is inside a good or mixed cell

When , a parent in a population of size n with r

parameters, is in a good or mixed cell, will perturb their

parameters a little in order to create the offspring, .

ix

ix

inx +

{ }

,...,2,1N(0,1)Csizexx iin •∗+=

∈∀

+ αni

,

where is a 1*r vector denoting the sizes in all r dimensions of the cell in which the parent is contained.

) ... ( 1 rCsizeCsize=Csize

ix))1,0( ... )1,0(( 1 rNN=N(0,1) denotes a 1*r

vector whose elements are samples from a zero-mean, unit-variance normal distribution. α is a selected positive number.

6.2. If a parent is in an average cell

If a parent ( ) is in an average cell, the offspring

( ) has a higher probability of appearing in a good or mixed cell by using regional knowledge saved in the belief space. It is achieved as follows.

ix

inx +

• Probabilistically select a search space cell represented by the regional schema Cb , from the set of all currently regional schemata (the leaves of the regional schemata tree). The idea here is that a good or mixed cell can

have a higher probability of being chosen as a target. This process can be achieved by roulette-wheel selection.

• Use Cb as a beacon to attract the parent to move a copy toward the selected target cell in order to create an offspring.

{ } { ,,...,2,1 and ,...,2,1 mjni ∈∀∈∀ }

∗+>∗−

<∗+

=+

otherwise )1,0(* )1,0(* )1,0(*

,,,

,,,,,

,,,,,

,

jnjbjn

jbjnjnjbjn

jbjnjnjbjn

jin

NCsizexRightxifNCsizexLeftxifNCsizex

xββ

β,

where is the jth element of , which

corresponds to the jth optimization parameter. (or denoted as C

jinx ,+

Left=

inx +

Leftb,j

b.Leftj) is the lower bound of the cell Cb for the optimization parameter . is the upper

bound of cell C

j

jb,

b,jRight

b,jCsize

β

b for parameter . is the size of C

jb for parameter j, and we have

. is certain positive number.

jbb,j CsizeCsize , +

6.3. If a parent is in an bad cell

If a parent ( x ) is in a bad cell, then the offspring

( ) will be created in a good or mixed cell by using the regional knowledge saved in the belief space. It is achieved in the following two steps.

i

inx +

• Select a target cell, represented here by Cb, probabilistically from the set of the {good ∪ mixed} cells. This process can be implemented by a uniform-distribution selection.

• The parent ( ) creates the offspring child in the target cell represented by C

ixb .

bbin CsizeUniformLeftx •+=+ )1,0( where Leftb is a 1*r array which denotes the left-most position of cell Cb . Csizeb is a 1*r array that represents the sizes of Cb in r dimensions; and Uniform(0,1) generates an 1*r array whose elements are uniformly distributed random numbers within [0,1]. " means element-by-element array multiplication.

"•

7. Mining the Valley of Oaxaca

Archaeological Datset The data used here comes from a large-scale spatial-

temporal archaeological database with hundreds of variables. These archaeological data were collected from the Valley of Oaxaca in Mexico. This area is among the very few places on earth where large amounts of data have


been systematically collected to analyze cultural evolution. The ultimate goal of this research is to extract hidden temporal and spatial patterns that reflect the important cultural and economical changes in the Valley of Oaxaca between 9000 B.C. and prior to the Spanish conquest in the 1500 A.D. Our preliminary question concerns the constraints that were placed on where people settled, and how these constraints changed over time as a reflection of Monte Alban’s status in the valley. To illustrate, we examine a specific site in the region, Monte Alban. This site evolved to become a large urban center, and the capital of the Zapotec State. The site is broken up into several thousand residential terraces, with each terrace described in terms of several hundred variables. Each terrace contains variables describing its environment, the presence of architectural structures and other information about artifacts such as pottery and stone tools. The database contains a large amount of information about the contents of occupant terraces, including different variables of the terraces and the number of ceramic sherds (pieces) found within these terraces.

The history of these terraces in Monte Alban is broken into a number of periods: Period I through Period V. And some periods, such as Period I, can be further sub-divided, for instance, Period Early I and Period Late I. Table 1 indicates the times for these periods.

Table 1. History of Monte Alban

7.1. The implementation and the example problem

For each of the periods from Period Early I to Period V,

we apply our approach to mine the knowledge in this database. Let us specify the search in the population space and the knowledge evolution process in the belief space respectively. In the population space, the search process is to find the terraces with certain maximal (minimal) values, such as those terraces with the largest amount of diagnostic ceramic sherds corresponding to a certain period. In the belief space, a series of schemata are maintained to indicate the extracted knowledge. As the population searches for terraces with the largest amount of the diagnostic ceramic sherds, information about the terraces that are associated with higher (and lower) values of these diagnostic ceramics for a given time period can be extracted and reasoned about in the belief space. This will allow the belief space to accumulate knowledge about properties that are held in common by the terraces occupied during a given period.

The regional schemata support the representation and reasoning for multi-dimensional correlative variables. However, for the purpose of visualization, we can keep most of the variables in the belief space as one- and two-dimensional structures. For a real application, due to the logical relationships between the variables, many variables must be considered dependently, for example, the “north grid coordinates” and the “east grid coordinates”. These two coordinate variables must be interdependent, since each of them can tell us little and only the combination of the two variables can tell us which special sub-regions have large amount of ceramics. For these kinds of variables, we must use the 2- or multi-dimensional regional schemata representation. For those variables that are not obviously correlated, we can first assume the variables to be independent, thus the regional schemata in the belief space can be 1-dimensional. Since the KDD process often involves significant iteration, it is possible to adjust the dimensions of schemata if these current estimated 1-dimensional schemata are not suitable. Thus, these 1-dimensional schemata can be changed into multi-dimensional schemata in the future if necessary. For the non-real-valued variables, we can use any other extensive indicators to record the knowledge. The extensive indicators are used for the recording of the non-real-valued knowledge evolution in the belief space. For instance, for variable “topography”, we can use the percentage for each case (near flat, hilltop…) in a generation to record and indicate the knowledge evolution. Here we give the example problem.

Period V A.D. 950 ------- A.D. 1500 Period IIIb-IV A.D. 450--------A.D. 950 Period IIIa A.D. 200--------A.D. 450 Period II 200 B.C.--------A.D. 200 Period Late I 300 B.C.--------200 B.C Period Early I 500 B.C.--------300 B.C.

The Description of an Example Problem

Z = f(x,y) –x: East grid coordinates –y: North grid coordinates –Z: the number of diagnostic ceramic sherds

Here, we focus on the ceramic information for the following reasons: 1) the ceramics are comparatively well preserved in this area and can be collected in the large quantities necessary for analysis. 2) Ceramics contain chronological information and can be used to date the occupation of a terrace. 3) Other artifacts, such as lithic artifacts, are more difficult to date. Thus, for any given time, the size and location of ceramic production loci are most likely functions of the nature of the economic and political organization of the valley. The number of ceramics can also be used to estimate other facts, such as the population density of an area.

The experimental parameters are given as follows. Population Size=50. The acceptance function parameters: αg=0.3, αb=0.3. The acceptance function parameters: α =0.3, β=0.3. The experiments were run on a Dell workstation with two CPUs.


0 2 4 6 8 10 12 14 16 18 20 220

10

20

30

40

50

60

70

80

Generation

Fitn

ess

Fitness vs. Generation (Period Late I)

Fig.2 indicates the fitness improvement of the population during the search of an example run for the Period Late I. The interval in each generation indicates the composition of the population in terms of the objective values (amount of ceramics in terraces). For instance, in generation 1, the objective (fitness) values of the population range from 0 to 24. The line labelled “O—O” describes the changes in the best-fitness-value from generation to generation. The line labelled “ --- ” shows how the average-fitness-value changes from generation to generation. It indicates that both the fitness values and the average fitness values were improved during the search. We can see that our approach can find the terrace with most diagnostic ceramics for a given period (here with the number 86) quickly, which is suggested by the small number of 12 generations. However, we are more interested here in how other variables are constrained by the search process. 7.2. Mining knowledge about regions with most

ceramics sherds Here we give examples of the pattern extractions from

the evolution of the multi-dimensional regional schemata. As we mentioned earlier, the “north grid coordinates” and “the east grid coordinates” must be interdependently considered, thus a collection of 2-dimensional schemata is maintained and evolved in order to extract unknown patterns. The background of Fig.3 shows a rough map of Monte Alban, and the gray rectangles stand for a set of good schemata as that was defined before. The evolution of these gray rectangles (see Fig.3(a) to Fig.3(d)) indicates the search process for more and more promising and specific

regions with abundant ceramics. Fig.3(c), indicates that the regions with most ceramics sherds are disjoint. Fig.3 also gives us a dynamic picture of where the people are more likely to settle within Monte Alban for a given period. Note that these schemata are self-organized during the search, and these regional schemata are not only indicators, but also directors, which are used to guide the evolutionary search in the population space.

00

generation # 1 Fig.3(a)

0 100 200 300 400 500 600 7000

100

200

300

400

500

600

700

east

north

0 100 200 300 400 500 600 7000

100

200

300

400

500

600

700

east

north

100 200 300 400 500 600 700

100

200

300

400

500

600

700

x

y

Contour Plot

0 100 200 300 400 500 600 7000

100

200

300

400

500

600

700

east

north

generation # 4 Fig.3(b) Fig. 2. The fitness improvement during the

search (Period Late I)

generation # 7 Fig.3(c)

generation # 8 Fig.3(d)

Fig.3. Knowledge evolution about the regions with abundant sherds

7.3. Mining elevation patterns

While the process of finding terraces with the largest

number of diagnostic ceramics is a basic search process, the important contribution is the generation of regional knowledge describing common characteristics of the terraces identified. In other words, the optimization task is a way to systematically drive the concept formation activities here. For example, Fig. 4 shows how the range of elevations for terraces that are likely to have more diagnostics ceramics for a given period is adjusted during the search. Here, the 1-dimensional regional-schemata for variable “Elevation” are used as indicator. This figure indicates that while the ceramics can be found in a large variety of elevations (here from 0m to 400m), the terraces with most diagnostic ceramics, shown by the “O”s, are found at elevations around 375 meters above the valley floor in Period Early I. This indicates that these terraces are on or near the top of the hills at Monte Alban. Assuming that the number of ceramics indicates the degree of likelihood that the terraces were occupied in a given period,


this pattern suggests that for a long time the “favorite” living places of the people was between 375m-400m above the valley floor on or near hilltops. This settlement pattern is very similar to those in many modern cities where people also tend to live on the hilltops. We also found this pattern in other periods. For example, from Period I to Period IV the elevation of terraces with the most diagnostic ceramics ranges within 375m-400m. Only in Period V does the “favorite” elevation drop to around 300 meters above the valley floor. This implies that during that period higher elevation were less important. This significant difference may lead us to identify the special concerns of the people in the period V before and during the collapse of the state. That is, what make them think in a different way than their ancestors?

0 2 4 6 8 10 12 14 16 18 20 22

150

200

250

300

350

400

Generation

Elevation vs. Generation (Period Early I)

Ele

vatio

n

7.4. Mining topology feature patterns

Fig.5 shows an example of the extensive indicator.

Variable “Topography” is not a real-valued variable, so here we use a statistical indicator, percentage, to record the evolution of the knowledge about the topography features. Fig.5(a) indicates that in Period Late I the constitution of topography of the terraces was changing as the search proceeded to the cell with the largest amount of ceramics. The y-coordinate indicates the percentage of the terraces with certain features (“near flat”, “slope”, etc.) in a generation. In the initial population, there was a large proportion of terraces (90%) located in the category of “moderate or steep slope”, but when terraces with more ceramics were found in the database, the percentage of the features was changed. The percentage of “moderate or steep slope” was decreasing, and the category “flat”, including “near flat” and “flat ridge top”, was increasing.

The terraces with the most ceramics was found in “near flat” terrace. Fig.5(b) shows the relationship between the percentage of features and the average number of diagnostic ceramics found in each generation. With the increase of the average number of diagnostic ceramics during the optimization search, the frequency of “near flat” increases and that of “moderate or steep slope” decreases. This pattern suggests that the people in this period (Period Late I) like to live in “flat” terraces. However, in Period V, the pattern is not the same as those of other periods. The terraces with the most ceramics in Period V were found in “moderate or steep slope”. This significant difference may lead us to identify the special concerns of the people in that period. What kind of special concerns that may change the behaviors of people in the Periods V?

00.2

0.40.6

0.81

1 3 5 7 9 11 13 15 17 19generation

Topo

grap

hy

near flat moderate to steep slope flat ridgetop

Fig. 5(a) Topography features vs. generation

00.20.40.60.8

1

0 20 40 60 80 100ave. # of Ceremics

Topo

grap

hy

near flat moderate to steep slope flat ridgetop Fig. 4. Mining elevation patterns during the

search (Period Early I) Fig. 5(b) Topography features vs. average number

of ceramics Fig. 5. Mining topography feature patterns

(Period Late I) 7.5. Mining terrace function patterns

00.2

0.40.6

0.81

0 20 40 60 80 100Ave.# of Ceramics

Terra

ce F

unct

ion

unknown residential elite

Fig. 6. Mining terrace function patterns (Period Late I)


Another interesting variable relates to whether the terrace occupied was an “elite” or “common” residence. As shown in Fig.6, when terraces with more ceramics are found during the search process, the percentage of elite-living terraces significantly increases, while that of normal-resident-living decreases. In other words, terraces with more ceramics are more likely to be classified as “elite-living”. Does that mean the elite are more likely to use more pottery than the non-elite? 7.6. Mining “distance to roads” patterns

Fig.7 shows another pattern in Period Early I where

more ceramics are found “directly adjacent to the major ancient road”. This figure indicates the relationship between the percentage of the “distance” features and the average number of diagnostic ceramics found in each generation. With the increase of the average number of diagnostic ceramics during the optimization search, the frequency of the terraces which are “directly adjacent to the major ancient roads” increases and that of the “far to the major ancient road” decreases. This pattern suggests that terraces with more ceramics in Period Early I are closer to road. Furthermore, it may imply that the people in Period Early I prefer to live in the terraces that are “directly adjacent to the major roads”. By searching each period, our data mining approach provides us with more interesting patterns of the “popular” living place. In Period Early I, ceramics are found more often “directly adjacent the major ancient road”. This may indicate the importance of the road for the people when the state in the Oaxaca valley emerged. Then, in the following periods, some terraces with most ceramics, often belonging to the “elite”, can be found “far to the road”. This may imply that in those periods, access to roads for the elite is not as important as some other factors, say, “privacy”.

The above are interesting patterns we have found by using our approach. In the pattern extraction process, unlike the traditional methods, only a small amount of records (terraces) in the dataset are accessed. This is because our approach can skip a number of cases in the dataset and still find something interesting. For all periods, instead of 100%, only less than 20% of the records in the

dataset are accessed. This suggests a great potential to reach the goal of efficiency and effectiveness for data mining.

8. Conclusions

This paper demonstrates how to extract knowledge

from a large-scale archaeological dataset using Cultural Algorithms. The results show that the new approach successfully extracts interesting patterns that we didn’t know about before. Using the new techniques, it doesn’t have to access all information in the database in order to mine some interesting patterns. Instead of 100%, only less than 20% of the records in the database are accessed for the data mining purposes. The new techniques can “naturally select” records from the database instead of an exhaustive search. In this framework, the data-mining process that occurs in the belief space and the optimization process that occurs in the population space are integrated and reciprocal. The experiments suggest a great potential to reach the goal of efficiency and effectiveness for data mining by using Cultural Algorithms and Regional Schemata.

00.20.40.60.8

1

1 3 5 7 9 11 13 15 17 19generation

"dis

tanc

e to

road

s"

far close directly adjacent

Fig. 7. Mining “distance to road” patterns (Period Early I)

References [1] M.S. Chen, J. Han, and P.S. Yu, “Data Mining: An Overview from a Database Perspective”, IEEE Transactions on Knowledge and Data Engineering, 8(6): 866-883, 1996. [2] Xidong Jin and R.G. Reynolds, “Using Knowledge-Based Evolutionary Computation to solve Nonlinear Constraint Optimization Problems: a Cultural Algorithm Approach,” in Proceedings of the 1999 Congress on Evolutionary Computation (CEC99), pp. 1672-1678, Washington, DC. , 1999. [3] Xidong Jin and R. G. Reynolds, “Using Knowledge-Based System with Hierarchical Architecture to Guide the Search of Evolutionary Computation,” in Proceedings of 11th IEEE International Conference on Tools with Artificial Intelligence, pp. 29-36, Chicago, IL, Nov. 1999. [4] M. Sternberg and R.G. Reynolds, “Using Cultural Algorithms to Support Re-Engineering of Rule-Based Expert Systems in Dynamic Performance Environment: A Case Study in Fraud Detection”, IEEE Transactions on Evolutionary Computation, Vol.1, No.4, pp.225-243, 1997.


Data mining using cultural algorithms and regional schemata ...

Documents