Understanding of data using Understanding of data using Computational Intelligence Computational Intelligence methods methods Włodzisław Duch Włodzisław Duch Dept. of Informatics, Dept. of Informatics, Nicholas Copernicus University, Nicholas Copernicus University, Toruń, Toruń, Poland Poland http://www.phys.uni.torun.pl/~duch http://www.phys.uni.torun.pl/~duch IEA/AIE Cairns, 17-20.06.2002 IEA/AIE Cairns, 17-20.06.2002
58
Embed
Understanding of data using Computational Intelligence methods Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Understanding of data using Understanding of data using Computational Intelligence methods Computational Intelligence methods
Understanding of data using Understanding of data using Computational Intelligence methods Computational Intelligence methods
Włodzisław DuchWłodzisław Duch
Dept. of Informatics, Dept. of Informatics, Nicholas Copernicus University, Nicholas Copernicus University,
What am I going to sayWhat am I going to sayWhat am I going to sayWhat am I going to say
• Data and CIData and CI• What we hope for. What we hope for. • Forms of understanding. Forms of understanding. • Visualization. Visualization. • Prototypes. Prototypes. • Logical rules. Logical rules. • Some knowledge discovered. Some knowledge discovered. • Expert system for psychometry. Expert system for psychometry. • Conclusions, or why am I saying this? Conclusions, or why am I saying this?
Types of DataTypes of DataTypes of DataTypes of Data
• Data was precious! Now it is overwhelming ...Data was precious! Now it is overwhelming ...
• Statistical data – clean, numerical, controlled Statistical data – clean, numerical, controlled experiments, vector space model. experiments, vector space model.
• Relational data – marketing, finances. Relational data – marketing, finances. • Textual data – Web, NLP, search. Textual data – Web, NLP, search. • Complex structures – chemistry, economics. Complex structures – chemistry, economics. • Sequence data – bioinformatics. Sequence data – bioinformatics. • Multimedia data – images, video. Multimedia data – images, video. • Signals – dynamic data, biosignals. Signals – dynamic data, biosignals. • AI data – logical problems, games, behavior …AI data – logical problems, games, behavior …
CI & AI definitionCI & AI definitionCI & AI definitionCI & AI definition• Computational Intelligence is concerned with Computational Intelligence is concerned with
This corresponds to all cognitive processes, This corresponds to all cognitive processes, including low-level ones (perception).including low-level ones (perception).
• Artificial Intelligence is a part of CI concerned Artificial Intelligence is a part of CI concerned with solving effectively non-algorithmic with solving effectively non-algorithmic problems requiring systematic reasoning and problems requiring systematic reasoning and symbolic knowledge representation. symbolic knowledge representation.
Roughly this corresponds to high-level Roughly this corresponds to high-level cognitive processes.cognitive processes.
Turning data into knowledgeTurning data into knowledgeTurning data into knowledgeTurning data into knowledge
What should CI methods do?What should CI methods do?
• Provide descriptive and predictive non-Provide descriptive and predictive non-parametric models of data.parametric models of data.
• Allow to classify, approximate, associate, Allow to classify, approximate, associate, correlate, complete patterns.correlate, complete patterns.
• Allow to discover new categories and Allow to discover new categories and interesting patterns.interesting patterns.
• Help to visualize multi-dimensional Help to visualize multi-dimensional relationships among data samples. relationships among data samples.
• Allow to understand the data in some way.Allow to understand the data in some way.• Facilitate creation of ES and reasoning. Facilitate creation of ES and reasoning.
Forms of useful knowledgeForms of useful knowledgeForms of useful knowledgeForms of useful knowledge
But ... knowledge accessible to humans is in: But ... knowledge accessible to humans is in:
• symbols, symbols, • similarity to prototypes, similarity to prototypes, • images, visual representations. images, visual representations.
What type of explanation is satisfactory?What type of explanation is satisfactory?Interesting question for cognitive scientists.Interesting question for cognitive scientists.
Different answers in different fields. Different answers in different fields.
Data understandingData understandingData understandingData understanding
Types of explanation: Types of explanation:
• visualization-based: maps, diagrams, relations ... visualization-based: maps, diagrams, relations ... • exemplar-based: prototypes and similarity;exemplar-based: prototypes and similarity;• logic-based: symbols and rules. logic-based: symbols and rules.
• Humans remember examples of each Humans remember examples of each category and refer to such examples – category and refer to such examples – as similarity-based or nearest-as similarity-based or nearest-neighbors methods do.neighbors methods do.
• Humans create prototypes out of many Humans create prototypes out of many examples – as Gaussian classifiers, RBF examples – as Gaussian classifiers, RBF networks, neurofuzzy systems do. networks, neurofuzzy systems do.
• Logical rules are the highest form of Logical rules are the highest form of summarization of knowledge. summarization of knowledge.
All projections (cuboids) on 2D subspaces are All projections (cuboids) on 2D subspaces are identical, dendrograms do not show the structure.identical, dendrograms do not show the structure.
Normal and malignant lymphocytes.Normal and malignant lymphocytes.
All projections (cuboids) on 2D subspaces are All projections (cuboids) on 2D subspaces are identical, dendrograms do not show the structure.identical, dendrograms do not show the structure.
3-bit parity + all 5-bit combinations.3-bit parity + all 5-bit combinations.
Try to preserve all distances in 2D nonlinear mappingTry to preserve all distances in 2D nonlinear mapping
MDS large sets using LVQ + relative mapping. MDS large sets using LVQ + relative mapping.
Prototype-based rulesPrototype-based rules
IF P = arg minIF P = arg minR R D(X,R) THAN Class(X)=Class(P)D(X,R) THAN Class(X)=Class(P)
C-rules (Crisp), are a special case of F-rules (fuzzy rules).C-rules (Crisp), are a special case of F-rules (fuzzy rules).F-rules (fuzzy rules) are a special case of P-rules (Prototype).F-rules (fuzzy rules) are a special case of P-rules (Prototype).P-rules have the form:P-rules have the form:
D(X,R) is a dissimilarity (distance) function, determining decision borders around prototype P.
P-rules are easy to interpret!
IF X=You are most similar to the P=SupermanTHAN You are in the Super-league.
IF X=You are most similar to the P=Weakling THAN You are in the Failed-league.
“Similar” may involve different features or D(X,P).
P-rulesP-rulesEuclidean distance leads to a Gaussian fuzzy Euclidean distance leads to a Gaussian fuzzy membership functions + product as T-norm. membership functions + product as T-norm.
Manhattan function => (X;P)=exp{|X-P|}
Various distance functions lead to different MF.
Ex. data-dependent distance functions, for symbolic data:
2
2
,,
, ,
,i i
i i ii
i i i i ii i
d X PW X PD
P i i ii i
D d X P W X P
e e e X P
X P
X P
X
, | |
, | |
VDM j i j ii j
PDF i j j ii j
D p C X p C Y
D p X C p C Y
X Y
X Y
Crisp P-rulesCrisp P-rulesCrisp P-rulesCrisp P-rulesNew distance functions from info theory New distance functions from info theory interesting MF. interesting MF.
Membership Functions Membership Functions new distance function, with local new distance function, with local D(X,R) for each cluster. D(X,R) for each cluster.
Crisp logic rules: use L norm:
D(X,P) = ||XP|| = maxi Wi |XiPi|
D(X,P) = const => rectangular contours.
L (Chebyshev) distance with thresholds P
IF D(X,P) P THEN C(X)=C(P)
is equivalent to a conjunctive crisp rule
IF X1[P1PW1,P1PW1] …… XN [PN PWN,PNPWN] THEN C(X)=C(P)
Complex objectsComplex objectsComplex objectsComplex objectsVector space concept is not sufficient for Vector space concept is not sufficient for complex object. A common set of features is complex object. A common set of features is meaningless. meaningless.
AI: complex objects, states, subproblems.
General approach: sufficient to evaluate similarity D(Oi,Oj).
Many T connecting a pair of objects Oi and Oj objects exist.
Cost of transformation = sum of k costs.
Similarity: lowest transformation costs.
Bioinformatics: sophisticated similarity functions for sequences.Dynamic programming finds similarities in reasonable time. Use adaptive costs and general framework for SBM methods.
ˆi k i j
k
O O O T
PromotersPromotersPromotersPromotersDNA strings, 57 aminoacids, 53 + and 53 - samples DNA strings, 57 aminoacids, 53 + and 53 - samples tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgttactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgt
Euclidean distance, symbolic s =a, c, t, g replaced by x=1, 2, 3, 4
PDF distance, symbolic s=a, c, t, g replaced by p(s|+)
Connection of CI with AIConnection of CI with AIConnection of CI with AIConnection of CI with AI
AI/CI division is harmful for science!AI/CI division is harmful for science!
GOFAI: operators, state transformations and search GOFAI: operators, state transformations and search techniques are basic tools in AI solving problems requiring techniques are basic tools in AI solving problems requiring systematic reasoning.systematic reasoning.
CI methods may provide useful heuristics for AI and define CI methods may provide useful heuristics for AI and define metric relations between states, problems or complex objects. metric relations between states, problems or complex objects.
Example: combinatorial productivity in AI systems and FSM.
Later: decision tree for complex structures.
Electric circuit exampleElectric circuit exampleElectric circuit exampleElectric circuit exampleAnswering questions in complex domains requires reasoning.Answering questions in complex domains requires reasoning.Qualitative behavior of electric circuit: Qualitative behavior of electric circuit:
7 variables, but Ohm’s law V=I7 variables, but Ohm’s law V=IR, or Kirhoff’s law VR, or Kirhoff’s law Vtt=V=V11+V+V22
Train a NeuroFuzzy system on Ohm’s and Kirhoff’s laws.Train a NeuroFuzzy system on Ohm’s and Kirhoff’s laws.Without solving equations; answer questions of the type:Without solving equations; answer questions of the type:
If If RR22 growsgrows, R, R11 && V Vtt are constantare constant, , what will happen with the what will happen with the current I and voltages current I and voltages V1, V2 ?V1, V2 ?
(taken from the PDP book, McClleland, Rumelhart, Hinton) (taken from the PDP book, McClleland, Rumelhart, Hinton)
Electric circuit searchElectric circuit searchElectric circuit searchElectric circuit searchAI: create search tree, CI: provide guiding intuition.AI: create search tree, CI: provide guiding intuition.
Any law of the form A=B*C or A=B+C, ex: V=I*R, has 13 true Any law of the form A=B*C or A=B+C, ex: V=I*R, has 13 true facts, 14 false facts and may be learned by NF system.facts, 14 false facts and may be learned by NF system.
Geometrical representation:
+ increasing, - decreasing, 0 constant
Find combination of Vt, Rt, I, V1, V2, R1, R2 for which all 5 constraints are fulfilled.
For 111 cases put of 37=2187
Search and check if X can be +, 0, -, laws are not satisfied
if F(Vt=0, Rt, I, V1, V2, R1=0, R2=+) =0
5
1 2 1 21
( , , , , , , ) ( , , )t t i i i ii
F V R I V V R R F A B C
Heuristic searchHeuristic searchHeuristic searchHeuristic searchIf If RR22 growsgrows, R, R11 && V Vtt are constantare constant, , what will happen what will happen with the current I and voltages with the current I and voltages V1, V2 ?V1, V2 ?
We know that: We know that: RR22 =+=+, R, R11 =0, =0, VVtt =0, =0, VV11=?=?, V, V22==??, R, Rtt=?, I =? =?, I =?
Take Take V1=+ and check if:F(Vt=0, Rt=?, I=?, V1=+, V2=?, R1=0, R2=+) >0
Since for all V1=+, 0 and – the function is F()>0 take variable that leads to unique answer, Rt
Single search path solves the problems.
Useful also in approximate reasoning where only some conditions are fulfilled.
Logical rulesLogical rulesLogical rulesLogical rulesCrisp logic rules: for continuous Crisp logic rules: for continuous xx use linguistic use linguistic variables (predicate functions).variables (predicate functions).
Linguistic variables are used in crisp Linguistic variables are used in crisp (prepositional, Boolean) (prepositional, Boolean) logic logic rules: rules:
IF small-height(IF small-height(XX) AND has-hat() AND has-hat(XX) AND has-) AND has-beard(beard(XX) ) THEN (THEN (XX is a Brownie) is a Brownie) ELSE IF ... ELSE ... ELSE IF ... ELSE ...
Decision trees lead to specific decision borders.Decision trees lead to specific decision borders.
SSV tree on Wine data, proline + flavanoids contentSSV tree on Wine data, proline + flavanoids content
Decision tree forests: many decision trees of similar Decision tree forests: many decision trees of similar accuracy, but different selectivity and specificity.accuracy, but different selectivity and specificity.
Logical rules, if simple enough, are preferable.Logical rules, if simple enough, are preferable.
• Rules may expose limitations of black box Rules may expose limitations of black box solutions. solutions.
• Only relevant features are used in rules. Only relevant features are used in rules. • Rules may sometimes be more accurate than Rules may sometimes be more accurate than
NN and other CI methods. NN and other CI methods. • Overfitting is easy to control, rules usually Overfitting is easy to control, rules usually
have small number of parameters. have small number of parameters. • Rules forever !? Rules forever !?
A logical rule about logical rules is:A logical rule about logical rules is:
IF IF the number of rules is relatively small the number of rules is relatively smallAND the accuracy is sufficiently high. AND the accuracy is sufficiently high. THEN rules THEN rules may bemay be an optimal choice. an optimal choice.
Logical rules are preferred but ...Logical rules are preferred but ...• Only one class is predicted Only one class is predicted pp((CCii||XX,,MM)) = 0 or 1 = 0 or 1
black-and-white picture may be inappropriate in black-and-white picture may be inappropriate in many applications.many applications.
• Discontinuous cost function allow only non-Discontinuous cost function allow only non-gradient optimization. gradient optimization.
• Sets of rules are unstable: small change in the Sets of rules are unstable: small change in the dataset leads to a large change in structure of dataset leads to a large change in structure of complex sets of rules. complex sets of rules.
• Reliable crisp rules may reject some cases as Reliable crisp rules may reject some cases as unclassified.unclassified.
• Interpretation of crisp rules may be misleading.Interpretation of crisp rules may be misleading.
• Fuzzy rules are not so comprehensible. Fuzzy rules are not so comprehensible.
pp is a hit; is a hit; pp false alarm; false alarm; pp is a miss. is a miss.
Neural networksNeural networks and rulesand rulesNeural networksNeural networks and rulesand rules
Myocardial Infarction~ p(MI|X)
Sex Age SmokingECG: ST
PainIntensity
PainDuration
Elevation
0.7
51 1365Inputs:
Outputweights
Inputweights
Knowledge from networksKnowledge from networksKnowledge from networksKnowledge from networks
Simplify networks: force most weights to 0, quantize remaining parameters, be constructive!
• Regularization: mathematical technique improving predictive abilities of the network.• Result: MLP2LN neural networks that are equivalent to logical rules.
MLP2LNMLP2LNMLP2LNMLP2LN
Converts MLP neural networks into a network Converts MLP neural networks into a network performing logical operations (LN).performing logical operations (LN).
InputInputlayer layer
Aggregation: Aggregation: better featuresbetter features
Output: Output: one node one node per class. per class.
Linguistic units: Linguistic units: windows, filterswindows, filters
Learning dynamicsLearning dynamicsLearning dynamicsLearning dynamicsDecision regions shown every 200 training epochs in x3, x4 coordinates; borders are optimally placed with wide margins.
Feature Space Mapping (FSM) neurofuzzy system.Feature Space Mapping (FSM) neurofuzzy system.Neural adaptation, estimation of probability density Neural adaptation, estimation of probability density distribution (PDF) using single hidden layer network distribution (PDF) using single hidden layer network (RBF-like) with nodes realizing separable functions:(RBF-like) with nodes realizing separable functions:
1
; ;i i ii
G X P G X P
Fuzzy: Fuzzy: xx(no/yes) replaced by a degree (no/yes) replaced by a degree xx. Triangular, trapezoidal, Gaussian . Triangular, trapezoidal, Gaussian ...... MFMF..
M.f-s in many dimensions:
Heterogeneous systemsHeterogeneous systemsHeterogeneous systemsHeterogeneous systems
Homogenous systems: one type of “building blocks”, Homogenous systems: one type of “building blocks”, same type of decision borders.same type of decision borders.
Committees combine many models together, but lead to Committees combine many models together, but lead to complex models that are difficult to understand. complex models that are difficult to understand.
Discovering simplest class structures, its inductive bias:requires heterogeneous adaptive systems (HAS).
Ockham razor: simpler systems are better.
HAS examples:NN with many types of neuron transfer functions.k-NN with different distance functions.DT with different types of test criteria.
• There is no free lunch – provide different type of tools There is no free lunch – provide different type of tools for knowledge discovery. for knowledge discovery. Decision tree, neural, neurofuzzy, similarity-based, Decision tree, neural, neurofuzzy, similarity-based, committees.committees.
• Provide tools for visualization of data.Provide tools for visualization of data.• Support the process of knowledge discovery/model Support the process of knowledge discovery/model
building and evaluating, organizing it into projects.building and evaluating, organizing it into projects.
GhostMiner, data mining tools from our lab. GhostMiner, data mining tools from our lab.
• Separate the process of model building and Separate the process of model building and knowledge discovery from model use => knowledge discovery from model use =>
Breast cancer diagnosis. Breast cancer diagnosis. Breast cancer diagnosis. Breast cancer diagnosis.
Data from University of Wisconsin Hospital, Data from University of Wisconsin Hospital, Madison, collected by dr. W.H. Wolberg.Madison, collected by dr. W.H. Wolberg.
699 cases, 9 cell features quantized from 1 to 10:
clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, mitoses.
Tasks: distinguish benign from malignant cases.
Breast cancer rules. Breast cancer rules. Breast cancer rules. Breast cancer rules.
Data from University of Wisconsin Hospital, Data from University of Wisconsin Hospital, Madison, collected by dr. W.H. Wolberg.Madison, collected by dr. W.H. Wolberg.
Simplest rule from MLP2LN, large regularization:
If uniformity of cell size < 3Then benign Else malignant
Sensitivity=0.97, Specificity=0.85
More complex solutions (3 rules) give in 10CV:Sensitivity =0.95, Specificity=0.96, Accuracy=0.96
Breast cancer comparison. Breast cancer comparison. Breast cancer comparison. Breast cancer comparison.
SSV HAS WisconsinSSV HAS WisconsinSSV HAS WisconsinSSV HAS WisconsinHeterogeneous decision tree that searches not only for logical Heterogeneous decision tree that searches not only for logical rules but also for prototype-based rules.rules but also for prototype-based rules.
Single P-rule gives simplest known description of this data: Single P-rule gives simplest known description of this data:
IF ||X-RIF ||X-R303303|| < 20.27 then malignant|| < 20.27 then malignant
else benignelse benign
18 errors, 97.4% accuracy. Good prototype for malignant! 18 errors, 97.4% accuracy. Good prototype for malignant!
Simple thresholds, that’s what MDs like the most!Simple thresholds, that’s what MDs like the most!
Best L1O error Best L1O error 98.3% (FSM), 98.3% (FSM),
best 10CV around best 10CV around 97.5% (Naïve Bayes + kernel, SVM) 97.5% (Naïve Bayes + kernel, SVM)
C 4.5 gives C 4.5 gives 94.7±2.0% 94.7±2.0%
SSV without distances: 96.4±2.1%SSV without distances: 96.4±2.1%
Several simple rules of similar accuracy in CV tests exist.Several simple rules of similar accuracy in CV tests exist.
Collected in the Outpatient Center of Dermatology in Rzeszów, Poland.
Four types of Melanoma: benign, blue, suspicious, or malignant.
250 cases, with almost equal class distribution.
Each record in the database has 13 attributes: asymmetry, border, color (6), diversity (5).
TDS (Total Dermatoscopy Score) - single index
Goal: hardware scanner for preliminary diagnosis.
Melanoma skin cancerMelanoma skin cancerMelanoma skin cancerMelanoma skin cancer
Printed formsPrinted forms are scanned or are scanned or computerized versioncomputerized version of the test is used. of the test is used.
• Raw data: 550 questions, ex:I am getting tired quickly: Yes - Don’t know - No
• Results are combined into 10 clinical scales and 4 validity scales using fixed coefficients.
• Each scale measures tendencies towards hypochondria, schizophrenia, psychopathic deviations, depression, hysteria, paranoia etc.
PsychometryPsychometryPsychometryPsychometry
• There is no simple correlation between single values and final diagnosis.
• Results are displayed in form of a histogram, called ‘a psychogram’. Interpretation depends on the experience and skill of an expert, takes into account correlations between peaks.
Goal: an expert system providing evaluation and interpretation of MMPI tests at an expert level.
Problem: agreement between experts only 70% of the time; alternative diagnosis and personality changes over time are important.
Psychometric dataPsychometric dataPsychometric dataPsychometric data
1600 cases for woman, same number for men.1600 cases for woman, same number for men.
27 classes: 27 classes: norm, psychopathic, schizophrenia, paranoia, norm, psychopathic, schizophrenia, paranoia, neurosis, mania, simulation, alcoholism, drug neurosis, mania, simulation, alcoholism, drug addiction, criminal tendencies, abnormal addiction, criminal tendencies, abnormal behavior due to ... behavior due to ...
Extraction of logical rules: 14 scales = features.
Define linguistic variables and use FSM, MLP2LN, SSV - giving about 2-3 rules/class.
Psychometric dataPsychometric dataPsychometric dataPsychometric data
10-CV for FSM is 82-85%, for C4.5 is 79-84%. Input uncertainty ++GGxx around 1.5% (best ROC) improves FSM results to 90-92%.
MethodMethod DataData N. rulesN. rules AccuracyAccuracy ++GGxx%%
C 4.5C 4.5 ♀♀ 5555 93.093.0 93.793.7
♂♂ 6161 92.592.5 93.193.1
FSMFSM ♀♀ 6969 95.495.4 97.697.6
♂♂ 9898 95.995.9 96.996.9
Psychometric ExpertPsychometric ExpertPsychometric ExpertPsychometric ExpertProbabilities for different classes. Probabilities for different classes. For greater uncertainties more For greater uncertainties more classes are predicted. classes are predicted.
Fitting the rules to the conditions:Fitting the rules to the conditions:typically 3-5 conditions per rule, typically 3-5 conditions per rule, Gaussian distributions around Gaussian distributions around measured values that fall into the measured values that fall into the rule interval are shown in green. rule interval are shown in green.
Verbal interpretation of each Verbal interpretation of each case, rule and scale dependent.case, rule and scale dependent.
VisualizationVisualizationVisualizationVisualizationProbability of classes versus Probability of classes versus input uncertainty.input uncertainty.
Detailed input probabilities Detailed input probabilities around the measured values around the measured values vs. change in the single scale; vs. change in the single scale; changes over time define changes over time define ‘patients trajectory’. ‘patients trajectory’.
Interactive multidimensional Interactive multidimensional scaling: zooming on the new scaling: zooming on the new case to inspect its similarity to case to inspect its similarity to other cases.other cases.
ConclusionsConclusionsConclusionsConclusionsData understanding is challenging problem.Data understanding is challenging problem.
• Classification rules are frequently only the first step and Classification rules are frequently only the first step and may not be the best solution.may not be the best solution.
• Visualization is always helpful. Visualization is always helpful. • P-rules may be competitive if complex decision borders P-rules may be competitive if complex decision borders
are required, providing different types of rules. are required, providing different types of rules. • Understanding of complex objects is possible, although Understanding of complex objects is possible, although
difficult, using adaptive costs and distance as least difficult, using adaptive costs and distance as least expensive transformations (action principles in physics). expensive transformations (action principles in physics).
• Great applications are coming! Great applications are coming!
ChallengesChallengesChallengesChallenges
• Discovery of theories rather than data modelsDiscovery of theories rather than data models• Integration with image/signal analysisIntegration with image/signal analysis• Integration with reasoning in complex domainsIntegration with reasoning in complex domains• Combining expert systems with neural networksCombining expert systems with neural networks
……..
Fully automatic universal data analysis systems: Fully automatic universal data analysis systems: press the button and wait for the truth …press the button and wait for the truth …
We are slowly getting there. We are slowly getting there.
More & more computational intelligence tools More & more computational intelligence tools (including our own) are available. (including our own) are available.