Computational intelligence Computational intelligence tools tools for data understanding for data understanding Włodzisław Duch Włodzisław Duch Department of Informatics Department of Informatics Nicholas Copernicus University Nicholas Copernicus University Torun, Poland Torun, Poland & & School of Computer Engineering, School of Computer Engineering, Nanyang Technological University Nanyang Technological University Singapore Singapore
66
Embed
Computational intelligence tools for data understanding Włodzisław Duch Department of Informatics Nicholas Copernicus University Torun, Poland & School.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Computational intelligence tools Computational intelligence tools for data understanding for data understanding
Computational intelligence tools Computational intelligence tools for data understanding for data understanding
Włodzisław DuchWłodzisław DuchDepartment of InformaticsDepartment of Informatics
Nicholas Copernicus University Nicholas Copernicus University Torun, PolandTorun, Poland
&&School of Computer Engineering,School of Computer Engineering,Nanyang Technological UniversityNanyang Technological University
SingaporeSingapore
PlanPlanPlanPlanWhat this tutorial is about ?
• How to discover knowledge in data; • how to create comprehensible models of data; • how to evaluate new data;• how to understand what CI methods do.
1. AI, CI & Data Mining2. Forms of useful knowledge3. Integration of different methods in GhostMiner 4. Exploration & Visualization5. Rule-based data analysis 6. Neurofuzzy models7. Neural models, understanding what they do8. Similarity-based models, prototype rules9. Case studies10. From data to expert system
AI, CI & DMAI, CI & DMAI, CI & DMAI, CI & DM
Artificial Intelligence: symbolic models of knowledge. • Higher-level cognition: reasoning, problem solving,
• There is no free lunch – provide different type of tools for knowledge discovery. Decision tree, neural, neurofuzzy, similarity-based, SVM, committees.
• Provide tools for visualization of data.
• Support the process of knowledge discovery/model building and evaluating, organizing it into projects.
GhostMiner, data mining tools from our lab + Fujitsu: http://www.fqspl.com.pl/ghostminer/
• Separate the process of model building (hackers) and knowledge
discovery, from model use (lamers) =>
GhostMiner Developer & GhostMiner Analyzer
Wine data exampleWine data exampleWine data exampleWine data example
Chemical analysis of wine from grapes grown in the same region in Italy, but derived from three different cultivars.
Task: recognize the source of wine sample.13 quantities measured, all features are continuous:
• malic acid content • alkalinity of ash • total phenols content • nonanthocyanins phenols
content • color intensity • hue• proline.
Exploration and visualizationExploration and visualizationExploration and visualizationExploration and visualization
General info about the data
Exploration: dataExploration: dataExploration: dataExploration: data
Inspect the data
Exploration: data statisticsExploration: data statisticsExploration: data statisticsExploration: data statisticsDistribution of feature values
Proline has very large values, most methods will benefit from data standardization before further processing.
Exploration: data standardizedExploration: data standardizedExploration: data standardizedExploration: data standardizedStandardized data: unit standard deviation, about 2/3 of all data should fall within [mean-std,mean+std]
Other options: normalize to [-1,+1], or normalize rejecting p% of extreme values.
Decision treesDecision treesDecision treesDecision treesSimplest things first: use decision tree to find logical rules.
Test single attribute, find good point to split the data, separating vectors from different classes. DT advantages: fast, simple, easy to understand, easy to program, many good algorithms.
Decision bordersDecision bordersDecision bordersDecision bordersUnivariate trees: test the value of a single attribute x < a.
Multivariate trees:test on combinations of attributes.
Result: feature space is divided into large hyperrectangular areas with decision borders perpendicular to axes.
Splitting criteriaSplitting criteriaSplitting criteriaSplitting criteriaMost popular: information gain, used in C4.5 and other trees.
CART trees use Gini index of node purity:
Which attribute is better?
Which should be at the top of the tree?
Look at entropy reduction, or information gain index.2 2( ) lg lgt t f fE S P P P P
( , ) ( ) ( ) ( )S S
G S A E S E S E SS S
2
1
1C
ini ii
G node P
Non-Bayesian selectionNon-Bayesian selectionNon-Bayesian selectionNon-Bayesian selectionBayesian MAP selection: choose max a posteriori P(C|X)
Problem: for binary features non-optimal decisions are taken!
MC(A1)<MC(A2), AUC(A1)<AUC(A2), but MI(A1)>MI(A2) !
SSV decision treeSSV decision treeSSV decision treeSSV decision treeSeparability Split Value tree: based on the separability criterion.
SSV criterion: separate as many pairs of vectors from different classes as possible; minimize the number of separated from the same class.
( ) 2 , , , ,
min , , , , ,
c cc C
c cc C
SSV s LS s f D D RS s f D D D
LS s f D D RS s f D D
, , : ( , ) T
, , , ,
LS s f D D f s
RS s f D D LS s f D
X X
Define subsets of data D using a binary test f(X,s) to split the data
into left and right subset D = LS RS.
SSV – complex treeSSV – complex treeSSV – complex treeSSV – complex treeTrees may always learn to achieve 100% accuracy.
Very few vectors are left in the leaves – splits are not reliable and will overfit the data!
SSV – simplest treeSSV – simplest treeSSV – simplest treeSSV – simplest treePruning finds the nodes that should be removed to increase generalization – accuracy on unseen data.
Trees with 7 nodes left: 15 errors/178 vectors.
SSV – logical rulesSSV – logical rulesSSV – logical rulesSSV – logical rulesTrees may be converted to logical rules.Simplest tree leads to 4 logical rules:
1. if proline > 719 and flavanoids > 2.3 then class 1
2. if proline < 719 and OD280 > 2.115 then class 2
3. if proline > 719 and flavanoids < 2.3 then class 3
4. if proline < 719 and OD280 < 2.115 then class 3
How accurate are such rules? Not 15/178 errors, or 91.5% accuracy!
Run 10-fold CV and average the results.85±10%? Run 10X and average85±10%±2%? Run again ...
SSV – optimal trees/rulesSSV – optimal trees/rulesSSV – optimal trees/rulesSSV – optimal trees/rulesOptimal: estimate how well rules will generalize.Use stratified crossvalidation for training;use beam search for better results.
1. if OD280/D315 > 2.505 and proline > 726.5 then class 1
2. if OD280/D315 < 2.505 and hue > 0.875 and malic-acid < 2.82 then class 2
3. if OD280/D315 > 2.505 and proline < 726.5 then class 2
4. if OD280/D315 < 2.505 and hue > 0.875 and malic-acid > 2.82 then class 3
5. if OD280/D315 < 2.505 and hue < 0.875 then class 3
Note 6/178 errors, or 91.5% accuracy! Run 10-fold CV: results are 85±10%? Run 10X!
Logical rulesLogical rulesCrisp logic rules: for continuous x use linguistic variables (predicate functions).
sk(x) ş True [XkŁ x ŁX'k], for example: small(x) = True{x|x < 1}medium(x) = True{x|x [1,2]}large(x) = True{x|x > 2}
Linguistic variables are used in crisp (prepositional, Boolean) logic rules:
IF small-height(X) AND has-hat(X) AND has-beard(X) THEN (X is a Brownie) ELSE IF ... ELSE ...
Crisp logic is based on rectangular membership functions:
True/False values jump from 0 to 1.
Step functions are used for partitioning of the feature space.
Very simple hyper-rectangular decision borders.
Expressive power of crisp logical rules is very limited!
Similarity cannot be captured by rules.
Logical rules - advantagesLogical rules - advantagesLogical rules - advantagesLogical rules - advantagesLogical rules, if simple enough, are preferable.
• Rules may expose limitations of black box solutions.
• Only relevant features are used in rules.
• Rules may sometimes be more accurate than NN and other CI methods.
• Overfitting is easy to control, rules usually have small number of parameters.
• Rules forever !? A logical rule about logical rules is:
IF the number of rules is relatively smallAND the accuracy is sufficiently high. THEN rules may be an optimal choice.
Logical rules - limitationsLogical rules - limitationsLogical rules - limitationsLogical rules - limitationsLogical rules are preferred but ...
• Only one class is predicted p(Ci|X,M) = 0 or 1; such black-and-white picture may be inappropriate in many applications.
• Discontinuous cost function allow only non-gradient optimization methods, more expensive.
• Sets of rules are unstable: small change in the dataset leads to a large change in structure of sets of rules.
• Reliable crisp rules may reject some cases as unclassified.
• Interpretation of crisp rules may be misleading.
• Fuzzy rules remove some limitations, but are not so comprehensible.
From rules to probabilitiesFrom rules to probabilitiesFrom rules to probabilitiesFrom rules to probabilitiesData has been measured with unknown error. Assume Gaussian distribution:
( ; , )x xx G G y x s
x – fuzzy number with Gaussian membership function.
A set of logical rules R is used for fuzzy input vectors: Monte Carlo simulations for arbitrary system => p(Ci|X)
Analytical evaluation p(C|X) is based on cumulant function:
1; , 1 erf ( )
2 2
a
x
x
a xa x G y x s dy a x
s
2.4 / 2 xs Error function is identical to logistic f. < 0.02
ROC curves display S+(1S) for different models (classifiers) or different confidence thresholds:
Ideal classifier: below some threshold S+ = 1 (all positive cases recognized) for 1-S= 0 (no false alarms) .
Useless classifier (blue): same number of true positives as false alarms for any threshold.
Reasonable classifier (red): no errors until some threshold that allows for recognition of 0.5 positive cases, no errors if 1-S > 0.6; slowly rising errors in between.
Good measure of quality: high AUC, Area Under ROC Curve.
AUC = 0.5 is random guessing, AUC = 1 is perfect prediction.
Fuzzification of rulesFuzzification of rulesFuzzification of rulesFuzzification of rules
Rule Ra(x) = {xa} is fulfilled by Gx with probability:
( ) T ; , ( )a x x
a
p R G G y x s dy x a
Error function is approximated by logistic function;
assuming error distribution (x)x)), for s2=1.7 approximates Gauss < 3.5%
Rule Rab(x) = {b> x a} is fulfilled by Gx with
probability:
( ) T ; , ( ) ( )b
ab x x
a
p R G G y x s dy x a x b
Soft trapezoids and NNSoft trapezoids and NNSoft trapezoids and NNSoft trapezoids and NNThe difference between two sigmoids makes a soft trapezoidal membership functions.
Conclusion: fuzzy logic with soft trapezoidal membership functions
(x) (x-b) to a crisp logic + Gaussian uncertainty of inputs.
Optimization of rulesOptimization of rulesOptimization of rulesOptimization of rules
Fuzzy: large receptive fields, rough estimations.Gx – uncertainty of inputs, small receptive fields.
Minimization of the number of errors – difficult, non-gradient, but
now Monte Carlo or analytical p(C|X;M).
21{ }; , | ; ( ),
2x i iX i
E X R s p C X M C X C
• Gradient optimization works for large number of parameters.
• Parameters sx are known for some features, use them as
optimization parameters for others! • Probabilities instead of 0/1 rule outcomes.• Vectors that were not classified by crisp rules have now non-zero
probabilities.
MushroomsMushroomsMushroomsMushroomsThe Mushroom Guide: no simple rule for mushrooms; no rule like: ‘leaflets three, let it be’ for Poisonous Oak and Ivy.
8124 cases, 51.8% are edible, the rest non-edible. 22 symbolic attributes, up to 12 values each, equivalent to 118 logical features, or 2118=3.1035 possible input vectors.
R1 + R2 are quite stable, found even with 10% of data;
R3 and R4 may be replaced by other rules, ex:
R'3): gill-size=narrow Ů stalk-surface-above-ring=(silky scaly) R'4): gill-size=narrow Ů population=clustered Only 5 of 22 attributes used! Simplest possible rules? 100% in CV tests - structure of this data is completely clear.
Recurrence of breast cancerRecurrence of breast cancerRecurrence of breast cancerRecurrence of breast cancerInstitute of Oncology, University Medical Center, Ljubljana.
286 cases, 201 no (70.3%), 85 recurrence cases (29.7%)
Many systems tried, 65-78% accuracy reported. Single rule:
IF (nodes-involved [0,2] degree-malignant = 3 THEN recurrence ELSE no-recurrence
77% accuracy, only trivial knowledge in the data: highly malignant cancer involving many nodes is likely to strike back.
Neurofuzzy systemNeurofuzzy systemNeurofuzzy systemNeurofuzzy system
Feature Space Mapping (FSM) neurofuzzy system.Neural adaptation, estimation of probability density distribution (PDF) using single hidden layer network (RBF-like), with nodes realizing separable functions:
1
; ;i i ii
G X P G X P
Fuzzy: x(no/yes) replaced by a degree x. Triangular, trapezoidal, Gaussian or other membership f.
M.f-s in many dimensions:
FSMFSMFSMFSM
Rectangular functions: simple rules are created, many nearly equivalent descriptions of this data exist.
If proline > 929.5 then class 1 (48 cases, 45 correct + 2 recovered by other rules).
If color < 3.79285 then class 2 (63 cases, 60 correct)
Interesting rules, but overall accuracy is only 88±9%
Initialize using clusterization or decision trees.Triangular & Gaussian f. for fuzzy rules.Rectangular functions for crisp rules.
Between 9-14 rules with triangular membership functions are created; accuracy in 10xCV tests about 96±4.5%
Scatterograms for Scatterograms for hypothyroidhypothyroidScatterograms for Scatterograms for hypothyroidhypothyroidShows images of training vectors mapped by neural network; for more than 2 classes either linear projections, or several 2D scatterograms, or parallel coordinates.
Good for:
• analysis of the learning process;
• comparison of network solutions;
• stability of the network;
• analysis of the effects of regularization;
• evaluation of confidence by perturbation of the query vector.
...
Details: W. Duch, IJCNN 2003
Neural networksNeural networksNeural networksNeural networks• MLP – Multilayer Perceptrons, most popular NN models.Use soft hyperplanes for discrimination.Results are difficult to interpret, complex decision borders. Prediction, approximation: infinite number of classes.
• RBF – Radial Basis Functions.
RBF with Gaussian functions are equivalent to fuzzy systems with Gaussian membership functions, but …
No feature selection => complex rules.
Other radial functions => not separable!
Use separable functions, not radial => FSM.
• Many methods to convert MLP NN to logical rules.
Rules from MLPsRules from MLPsRules from MLPsRules from MLPs
Learning dynamicsLearning dynamicsLearning dynamicsLearning dynamicsDecision regions shown every 200 training epochs in x3, x4 coordinates; borders are optimally placed with wide margins.
Thyroid – some results.Thyroid – some results.Thyroid – some results.Thyroid – some results.Accuracy of diagnoses obtained with several systems – rules are accurate.
Method Rules/Features Training % Test %
MLP2LN optimized 4/6 99.9 99.36
CART/SSV Decision Trees 3/5 99.8 99.33
Best Backprop MLP -/21 100 98.5
Naïve Bayes -/- 97.0 96.1
k-nearest neighbors -/- - 93.8
Thyroid – output visualization.Thyroid – output visualization.Thyroid – output visualization.Thyroid – output visualization.2D – plot scatterogram of the vectors transformed by the network.
3D – display it inside the cube, use perspective.
ND – make linear projection of network outputs on the polygon vertices
PsychometryPsychometryPsychometryPsychometryUse CI to find knowledge, create Expert System.
• There is no simple correlation between single values and final diagnosis.
• Results are displayed in form of a histogram, called ‘a psychogram’. Interpretation depends on the experience and skill of an expert, takes into account correlations between peaks.
Goal: an expert system providing evaluation and interpretation of MMPI tests at an expert level.
Problem: experts agree only about 70% of the time; alternative diagnosis and personality changes over time are important.
Psychometric dataPsychometric dataPsychometric dataPsychometric data
1600 cases for woman, same number for men.
27 classes: norm, psychopathic, schizophrenia, paranoia, neurosis, mania, simulation, alcoholism, drug addiction, criminal tendencies, abnormal behavior due to ...
Extraction of logical rules: 14 scales = features.
Define linguistic variables and use FSM, MLP2LN, SSV - giving about 2-3 rules/class.
Probabilities for different classes. For greater uncertainties more classes are predicted.
Fitting the rules to the conditions:typically 3-5 conditions per rule, Gaussian distributions around measured values that fall into the rule interval are shown in green.
Verbal interpretation of each case, rule and scale dependent.
VisualizationVisualizationVisualizationVisualizationProbability of classes versus input uncertainty.
Detailed input probabilities around the measured values vs. change in the single scale; changes over time define ‘patients trajectory’.
Interactive multidimensional scaling: zooming on the new case to inspect its similarity to other cases.
SummarySummarySummarySummaryComputational intelligence methods: neural, decision trees, similarity-based & other, help to understand the data.
Understanding data: achieved by rules, prototypes, visualization.
We are slowly getting there. All this and more is included in the Ghostminer, data mining software (in collaboration with Fujitsu) http://www.fqspl.com.pl/ghostminer/
Small is beautiful => simple is the best!
Simplest possible, but not simpler - regularization of models; accurate but not too accurate - handling of uncertainty;
high confidence, but not paranoid - rejecting some cases.
• Challenges:
hierarchical systems, discovery of theories rather than data models, integration with image/signal analysis, reasoning in complex domains/objects, applications in bioinformatics ...