-
NOVI MODEL RUDARJENJA PODATKOV S SIMBOLNO
ANALIZO PODATKOV
E. Diday ([email protected]), CEREMADE Paris-Dauphine
University. Paris (France)
POVZETEK Običajni model rudarjenja podatkov temelji na dveh
delih: prvi del se nanaša na enote ("posameznike"), drugi pa
vsebuje njihov opis z več standardnimi spremenljivkami; te so lahko
številčne ali kategorične. Model simbolne analize podatkov
potrebuje še dva dela: prvi se nanaša na enote, imenovane
"koncepti", ki predstavljajo razrede ali kategorije posameznikov,
drugi pa zadeva "opis" te nove vrste enot. Koncepti so opisani s
"simbolnimi podatki", ki so običajni kategorični ali numerični
podatki, ter intervali, histogrami, porazdelitve, sekvence
vrednosti ipd. Te nove vrste podatkov omogočajo ohranjanje
notranjih variacij obsega vsakega koncepta. Programska oprema SYR
(2009) je bila razvita po akademski programski opremi SODA, ki je
bila rezultat dveh evropskih projektov do leta 2003. Njen cilj je
izvleči iz podatkovne datoteke (.txt,.csv, accessova baza) z več
milijoni enot zmanjšano število "konceptov", ki povzemajo osnovne
podatke in so opisani s spremenljivkami, katerih vrednosti so
simbolni podatki. Primer simbolne podatkovne tabele je prikazan na
sliki 1. Potem je mogoče z novimi orodji za rudarjenje podatkov,
razširjenimi na koncepte, ki so nove enote (vizualni opis, analiza
osnovnih komponent, razvrščanje v skupine, drevesa odločanja,
ekstrakcija pravil, regresija ipd.), iz tega modela izvleči novo
znanje. Možno je pokazati, da so opisi simbolnih konceptov
strukturirani s "stohastičnimi Galoisovimi mrežami". Osnovni model
simbolnih spremenljivk so spremenljivke, katerih vrednosti so
naključne spremenljivke, in ne kot običajno številke, zato so
potrebne "kopule" oz. vezi. Veliko dela je potrebnega za potrditev,
stabilnost in robustnost rezultatov. Predstavljeni so nedavni
rezultati analize osnovnih komponent in standardne mešane
dekompozicije. Uporaba zajema vsa področja, kjer je novo znanje in
modele višje stopnje o konceptih treba pridobiti iz majhnih ali
velikih baz podatkov. Viri L. Billard, E. Diday (2003) “From the
Statistics of Data to the Statistics of Knowledge: Symbolic Data
Analysis”. JASA. Journal of the American Statistical Association.
Juin, Vol. 98, N° 462. L. Billard, E. Diday (2006) (authors)
“Symbolic Data Analysis: conceptual statistics and data Mining”.
Book Wiley. 330 pages. ISBN 0-470-09016-2 E. Diday, M. Noirhomme
(authors and editors) (2008) “ Symbolic Data Analysis and the SODAS
software” 457 pages. Wiley. ISBN 978—0-470-01883-5. E. Diday (2005)
"Categorization in Symbolic Data Analysis". In handbook of
categorization in cognitive science. Edited by H. Cohen and C.
Lefebvre. Elsevier editor. http://books.elsevier.com/elsevier/
isbn=0080446124 E. Diday , M. Vrac (2004) “Mixture Decomposition of
Distributions by Copulas In the Symbolic Data Analysis Framework”.
Journal of Discrete Applied Mathematics . (DAM). Volume 147, Issue
1, 1 April, Pages 27-41.
1
-
Hans-Hermann Bock, Edwin Diday (2000): Analysis of Symbolic Data
for extracting statistical information from complex data. Springer
Verlag, Heidelberg, 425 pages, ISBN 3-540-66619-2. E. Diday , R.
Emilion (2003) "Maximal and stochastic Galois Lattices" . Journal
of Discrete Applied Mathematics . 127 , 271-284.
Slika1 Simbolna tabela podatkov iz programske opreme SYR Ključne
besede: model rudarjenja podatkov, simbolna analiza podatkov,
programska oprema SYR
2
-
NEW DATA MINING MODEL BY SYMBOLIC DATA ANALYSIS.
ABSTRACT The usual Data Mining model is based on two parts: the
first concerns the units (called here “individuals”), the second,
contains their description by several standard variables including
numerical or categorical. The Symbolic Data Analysis model needs
two more parts: the first concerns units called “concepts”
representing classes or categories of individuals and the second
concerns the “description” of this new kind of units. Concepts are
described by “symbolic data” which are standard categorical or
numerical data and moreover interval, histograms, distributions,
sequences of values, etc. These new kind of data allows keeping the
internal variation of the extent of each concept. The SYR software
(2009) has been developed after the academic SODAS software issued
from two European projects until 2003. Its aim is to extract, from
a data file (.txt, .csv, ACCESS database) of several millions of
units a reduced number of “concepts” which summarizes the initial
data and are described by variables whose values are symbolic data.
An example of symbolic data table is given in Figure 1. Then, new
knowledge can be extracted from this model by new tools of Data
Mining extended to concepts considered as new units (visual
description, Principal Component Analysis, Clustering, Decision
trees, rule extraction, regression, etc.). It can be shown that the
symbolic concept descriptions are structured by “stochastic Galois
Lattices”. The underlying model of symbolic variables are variables
whose values are random variables instead of numbers as usual,
therefore “Copulas” are needed. Much work is needed for validation,
stability, robustness of the results. Recent results extending PCA
and standard mixture decomposition are presented. Applications
cover all domains where new knowledge and higher level models on
concepts have to be extracted from small or large data bases.
References L. Billard, E. Diday (2003) “From the Statistics of Data
to the Statistics of Knowledge: Symbolic Data Analysis”. JASA.
Journal of the American Statistical Association. Juin, Vol. 98, N°
462. L. Billard, E. Diday (2006) (authors) “Symbolic Data Analysis:
conceptual statistics and data Mining”. Book Wiley. 330 pages. ISBN
0-470-09016-2 E. Diday, M. Noirhomme (authors and editors) (2008) “
Symbolic Data Analysis and the SODAS software” 457 pages. Wiley.
ISBN 978—0-470-01883-5. E. Diday (2005) "Categorization in Symbolic
Data Analysis". In handbook of categorization in cognitive science.
Edited by H. Cohen and C. Lefebvre. Elsevier editor.
http://books.elsevier.com/elsevier/ isbn=0080446124 E. Diday , M.
Vrac (2004) “Mixture Decomposition of Distributions by Copulas In
the Symbolic Data Analysis Framework”. Journal of Discrete Applied
Mathematics . (DAM). Volume 147, Issue 1, 1 April, Pages 27-41.
Hans-Hermann Bock, Edwin Diday (2000): Analysis of Symbolic Data
for extracting statistical information from complex data. Springer
Verlag, Heidelberg, 425 pages, ISBN 3-540-66619-2.
3
-
E. Diday , R. Emilion (2003) "Maximal and stochastic Galois
Lattices" . Journal of Discrete Applied Mathematics . 127 ,
271-284.
Figure 1 The Symbolic data table provided by SYR Key words: data
mining model, symbolic data analysis, SYR software
4