Top Banner
INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan
62

INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Dec 16, 2015

Download

Documents

Gregory Frayne
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

INTRODUCTION TOSYMBOLIC DATA

ANALYSIS

E. Diday

CEREMADE. Paris–Dauphine University

TUTORIAL: 13 June 2014Activity Center, Academia Sinica, Taipei, Taiwan

Page 2: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

OUTLINE

PART 1: BUILDING SYMBOLIC DATA FROM STANDARD OR COMPLEX DATA

PART 2: SYMBOLIC DATA ANALYSIS Is Symbolic Data Analysis a new paradigm?

.PART 3: OPEN DIRECTION OF RESEARH

PART 4: SDA SOFTWARES: SODAS, SYR and R

PART 5: INDUSTRIAL APPLICATIONS

Page 3: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

PART 1

BUILDING SYMBOLIC DATA FROM STANDARD OR

COMPLEX DATA

Page 4: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

What is a standard Data Table?

Individuals

Players age height weight Nationality Club Team

Player 1

Messi

Ronaldo

Player n

It is a set of individuals (i.e. observations) described by a set of Numerical variables (as age, weight,..) or Categorical variables (as Nationality, club name,…).

Example:

Page 5: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

What are Complex Data? Any data which cannot be considered as a “standard observations x standard variables” data table.

ExampleThe individuals are Towers of nuclear power plants

described by• Table 1) Observations: Cracks . Variables: Cracks description.• Table 2) Observations: corrosions. Variables: corrosion description .• Table 3) Observations: vertices of a grid. Variables: Gap depression from the ground.

.

Page 6: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Why considering classes of individuals as new individuals?

Example: if we wish to know what makes a player wins, we are interested by a standard data table where the individuals are the players (in rows) described (in columns) by their standard caracteristic variables.

If our wish is now to know what makes a team wins, we are interested by a data table where the teams (in rows) are descibed by caracteristic variables of the teams taking care on the variability of the players inside each team.

The teams can be now considered as new individuals of higher level described by symbolic variables taking care on the variability of the individuals inside each class.

Page 7: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

From standard data tables to symbolic data tables

players X1 Xj

ind1 A

indi Xij

indn

X’jX’1

Ci

Ck

C1

A symbolic data in each cell(Bar chart age of the Messi Team)

Standard data table describing Football players (individuals).

Symbolic Data Table describing Teams (i.e. classes of individuals)in each cell a

number (age) or

a category(Nationality)

Weight interval

Age Bar chart

Nationalities

Bar chart

Some columns are contigency tables

Page 8: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

SYMBOLIC DATA EXPRESS VARIABILITY INSIDE CLASSES OF INDIVIDUALS

TEAM OF THE

MONDIAL

WEIGHT NATIONALITY NB OF GOALS

BARSA [75 , 89 ] {French} {0.8 (0), 0.2 (1)}

MANCHESTER [80, 95] {Fr, Alg, Arg } {0.1 (0), 0.3 (1), …}

PARIS-ST G. [76, 95] {Fr, Tun } {0.4 (0), 0.2 (1), …}

DORTMUND [70, 85] {Fr, Engl, Arg } {0.2 (0), 0.5 (1), …}

Here the variation (of weight, nationality, …) concerns the players of each team.

Therefore each cell can contain:

A number, an interval, a sequence of categorical values, a sequence of weighted values as a barchart, a distribution, …

THIS NEW KIND OF VARIABLES ARE CALLED « SYMBOLIC » BECAUSE THEY ARE NOT PURELY NUMERICAL IN ORDER TO EXPRESS THE INTERNAL VARIATION INSIDE EACH CLASS.

Page 9: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

What is the actual failure which has produced the SDA Paradigm?

The failure is that in the actual practice

Only the “individual” kind of observations is considered.

Therefore these individual observations are only described by standard numerical and categorical variables.

Page 10: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

The SDA paradigm shift

It is the transition from “individual observations” described by

standard variables of numerical or categorical values.

To “classes of individuals” (considered as “higher level observations”)

Described by “symbolic variables”, of “symbolic values” (intervals, probability distributions, sets of categories or numbers, random variables,…)

taking care on the variability inside the classes “symbolic values” can not be treated as numbers.

Page 11: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

First Step: we have a standard data table TAB1, where individuals are described by numerical or categorical random variables Yj .

Third step: we have a symbolic data table Table 3: where the random variables Yij are represented by:

•Probability distributions, histograms, bar charts, percentiles,…

•Intervals Min, Max, interquartil interval etc.

•Set of numbers or categories

•Functions as Time Series.

Second step : we have a Table 2: where classes of individuals are described by

random variables Y’j with random variables Yij value.

Building Symbolic Data needs three steps

Page 12: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

VARIABLES Standard variables value:

• numerical (income, profit,…),

• categorical (Countries, Stock-Exchange places,..)

Symbolic variables value:

• interval,

• bar chart,

• Histogram, etc.

Page 13: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Ten examples of Symbolic variables

Page 14: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

What kind of questions and how are they structured?

Page 15: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

How to build symbolic data from standard or complex data?

How to categorize the numerical, ordinal, nominal ground variables, in order that the obtained symbolic histograms or barchart variables for each class?

First: find the discretisation which discriminates as well as possible these classes.

Second or simultaneously: Maximize the correlation between the bins.

Page 16: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

SOME ADVANTAGES of SYMBOLIC DATA:

• Work at the needed level of generality without loosing variability.

• Reduce simple or complex huge data.

• Reduce number of observations and number of variables.

• Reduce missing data.

• Ability to extract simplified knowledge and decision from complex data.

• Solve confidentiality (classes are not confidential as individuals).

• Facilitate interpretation of results: decision trees, factorial analysis new graphic kinds.

• Extent Data Mining and Statistics to new kinds of data with much industrial applications.

Page 17: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

PART 2

SYMBOLIC DATA ANALYSIS

Page 18: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

- Graphical visualisation of Symbolic Data

- Correlation, Mean, Mean Square, distribution of a symbolic variables.

- Dissimilarities between symbolic descriptions

- Clustering of symbolic descriptions

- S-Kohonen Mappings

- S-Decision Trees

- S-Principal Component Analysis

- S-Discriminant Factorial Analysis

- S-Regression

- Etc...

SYMBOLIC DATA ANALYSIS TOOLS HAVE BEEN DEVELOPPED

Page 19: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

From standard observations to classes, the correlation is not the same!

• Observations data are uniformly distributed in the circle:

• no correlation between Y1 and Y2 for intial observations data.

• A correlation appears between the two variables for the centers of a given partition in 4 classes.

Y1

Y2

x

x

x

x

Page 20: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

WHY SYMBOLIC DATA CANNOT BE REDUCED TO A CLASSICAL STANDARD DATA TABLE?

Players category Weight Size Nationality

Very good [80, 95] [1.70, 1.95] {0.7 Eur, 0.3 Afr}

Players category

Weight Min

Weight Max

Size Min

Size Max

Eur Afr

Very good 80 95 1.70 1.95 0. 7 0.3

Symbolic Data Table

Concern:

The initial variables are lost and the variation is lost!

Transformation in classical data

Page 21: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Divisive Clustering or Decision tree

Symbolic Analysis Classical Analysis

Weight Max Weight

Page 22: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

PCA and NETWORK OF BAR CHART DATAof 30 Iris Fisher Data Clusters*

* SYROKKO Company [email protected]

Any symbolic variable (set of bins variables) can be projected. Here the species variable.

Page 23: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

The Symbolic Variables contributions are inside the smallest hyper cube containing the correlation sphere of the bins

Page 24: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Y2Y1

Ci

Ck

C1

(Y1(Ci ), Y2(Ci )) = ([a1i , b1i ], ([a2i , b2i ])

Y1

Y2

xx

a1

b1

b2

a2

Ci

x

Ci

a2a1

Ci

Ck

C1

b1b2

b1ib2ia1i

a2i

a1ib1i

a 2ib 2i

Numerical versus symbolical space of representation

Bi-plot of interval variablesNumerical representation of interval variables

Page 25: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Bi-plot of histogram variables

• The joint probability can be inferred by a copula model

Y2Y1

Ci

Ck

C1 Copula

Page 26: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

PART 3: OPEN DIRECTION OF RESEARH

• Models of models

• Law of parameters of laws

• Laws of vectors of laws.

• Copulas needed.

• Four general convergence theorem.

• Optimisation in non supervised learning (hierarchical and pyramidal clustering).

Page 27: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

From lower level of individual observation to higher level observation of classes:

higher level models are needed

Individual X1 Xj

ind1

Messi Xij

indn

X’jX’1Teams

Ci

Ck

C1

A symbolic data(age of Messi team)

Table 1 Table 2

A number(age of Messi)

Xj is a standard random numerical variableX’j is a random variable with histogram valueQuestion: if the law of Xj is given what is the law of X’j ? (Dirichlet models useful).

Page 28: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

f(i, j, j’) is the joint probability of the variables j and j’ for the individual i.

In case of independency , we havef(i, j, j’) = f(i, j’). f(i, j’),

If there is no dépendancy: f(i, j, j’) = Copula(f(i, j’). f(i, j’))

Aim of Copula model in SDA: find the Copula which minimises the difference with the joint. In order to avoid the restriction to independency hypotheses and to reduce the cost of f(i, j, j’) computing.

Why using copula models in Symbolic Data Analysis?

Page 29: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

M(n, k) is supposed to be a SDA method where k is the number of classes obtained on n initial individuals

THEOREME 1 : If the k classes are fixed and n tends towards infinity, then M(n, k) converges towards a stable position.THEOREME 2 : If k increases until getting a single individual by class, then M(n, k) converges towards a standard one. THEOREME 3 : I k and n increases simulataneously towards infinity, then M(n, k) converges towards a stableposition.THEOREME 4 If the k laws associated to the k classes are considered as a sample of a law of laws, then M(n, k) applied to this sample converges to M(n, k) applied to this law.

Exemples :Théorème 1: il a été démontré dans Diday, Emilion (CRAS, Choquet 1998), pour les treillis de Galois: à mesure que la taille de la population augmente les classes (décrites par des vecteurs de distributions), s’organisent dans un treillis de Galois qui converge. Emilion (CRAS, 2002) donne aussi un théorème dans le cas de mélanges de lois de lois utilisant les martingales et un modèle de Dirichlet.Théorème 2: Par ex, l’ACP classique MO est un cas particulier de l’ACP notée M(n, k) construite sur les vecteurs d’intervalles.Théorème 3: c’est le cadre de données qui arrivent séquentiellement (de type « Data Stream ») et des algorithmes de type one pass (voir par ex Diday, Murty (2005)). Théorème 4: Dans le cas d'une classification hiérarchique ou pyramidale 2D, 3D etc. la convergence signifie que les grands paliers et leur structure se stabilisent. Dans le cas d’une ACP la convergence signifie que les axes factoriels se stabilisent.

FOUR THEOREM TO BE PROVED FOR ANY EXTENDED METHOD TO SYMBOLIC DATA.

Page 30: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Each class is described by symbolic data

C2

A 1 B1 C1

C3

3D Spatial Pyramid

x1 x2 x3 x4 x5

Pyramides

Hierarchiesx1 x2 x3 x4 x5

S

2

S1

Ultrametric dissimilarity = U

Robinsonian dissimilarity = R

Yadidean dissimilarity = Y

W = |d - U |

W = |d - R |

W = |d - Y |

Optimisation in clustering

d is the given dissimilarity

Page 31: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

PART 4: SDA SOFTWARES:

SODASRSDASYR

Page 32: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

SoftwareTo build symbolic data from standard or complex data and analyze symbolic data, different software packages exist today.

SODAS - academic free package, though registration required and a code needed for installation, http://www.info.fundp.ac.be/asso/sodaslink.htmMuch Symbolic data data bases can be found at http://www.ceremade.dauphine.fr/SODAS/

RSDA: academic free packages are available on CRAN: [email protected]

SYR: professional package, see : [email protected]

Page 33: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

The objective of SCLUST is the clustering of symbolic objects by a dynamic algorithm based on symbolic data tables. The aim is to build a partition of SO´s into a predefined number of classes. Each class has a prototype in the form of a SO. The optimality criterion used is based on the sum of proximities between the individuals and the prototypes of the clusters.

SODAS SOFTWARE

Arbre de décision sur variables à valeur histogramme ou intervalle

ANALYSE FACTORIELLE: ACP de variables à valeur intervalle

Pyramide classifiante

CARTE DE KOHONEN DE CONCEPTS

Superposition de deux deux étoîles associées à deux classes de la pyramides

Page 34: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

FROM DATA BASE TO SYMBOLIC DATA IN SODAS

QUERY

Class description

Relational Data Base

Individuals

Classes

Symbolic Data Table

Description of individuals

Columns: symbolic variables

Cells contain Symbolic Data

Classes

Page 35: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Produce a Symbolic Data Table from complex data. Manage Symbolic Data Tables: sort rows and columns by discriminant power

Analyse Symbolic data tables: SPCA,Sclustering…

Produce network, rules and decision trees.

SYR SOFTWARE

Page 36: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

SYR: SYMBOLIC DATA TABLE MANAGEMENT

* SYROKKO Company [email protected]

SYMBOLIC DATA TABLE

Sorting rows by min, max of intervals or frequencies of barchart is possible.

Sorting variables by discriminate power of the concepts is also possible.

Page 37: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

PART 5: INDUSTRIAL APPLICATIONS

Page 38: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Each row represents a train going on the bridge at a given temperature,

each cell contains until 800.000 values.

Each cell is transformed in HISTOGRAM from a PROJECTION or from WAVELETS

Sensor 1 Sensor 2 Sensor 3 …. Sensor N

Time Series Data table: Anomaly detection on a bridge LCPC (Laboratoire Central Des Ponts et Chaussées) and SNCF Data

Trains

Page 39: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Symbolic procedureFrom numerical description of

pigs to symbolic description of Farms

• Numerical variablesand• Categorical variables are transformed in Bar Chart

of the frequencies based on 30 animals,

Or in interval value variables

125 farms x 30 animals

Description of pig respiratory

diseases

19 variables

125 farms

64 variables

Description of pig respiratory

diseases

Median score (continuous var.)

Animal frequencies (categorical var.)

HIERARCHICAL DATA*

*C. Fablet, S. Bougeard (AFSSA)

Page 40: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Step 1: Symbolic Description of Farms*

* SYROKKO Company [email protected]

Page 41: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Nuclear Power Plant

Find Correlations Between 3 Standard Data Tables of Different

observation units and different Variables

Page 42: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Cartography of the towel by a grid

Inspection :

CraksInspection machine

NUCLEAR POWER PLANTNuclear thermal power station

PB: FIND CORRELATIONS BETWEEN 3 CLASSICAL DATA TABLES OF DIFFERENT UNITS AND VARIABLES:Table 1) Observations: Cracks . Variables: Cracks description.Table 2) Observations: vertices of a grid. Variables: Gap deviation at different periods compared to the initial model position. Table 3) Observations: vertices of a grid. Variables: Gap depression from the ground.ARE Transformed in ONE Symbolic Data Table where the classes the towers. On this new table SDA can be applied.

Page 43: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

FROM COMPLEX DATA TO SYMBOLIC DATA

Page 44: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Towers on PCA first axes PCA on chooosen symbolic

variables

Three clusters.visualisation

Interval and bar chart variables can be seen..

A network of the strongest links can be represented.

NETSYR results (SYR software)

Page 45: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Symbolic variables projection inside the hypercube of the correlation sphere

Page 46: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Telephone calls text mining in order to discover “themes” without using semantic

Each calling session is called a document. We start after lemmatisation with a table of • 31454 documents • 2258 words

Documents Words

Doc1 bonjour

Doc1 oui

Doc1 monsieur

………

Doc2 panne

……

Correspondence between documents and words.

INITIAL DATA: 2 814 446 rows

Page 47: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

First Steps:building overlapping clusters of documents and words: CLUSTSYR

2 814 446 rows:

Correspondence documents, words

70 Overlapping Clusters of Documents described by the tf-idf of 2258 words.

2258 Words described by their tf-idf on the 70 clusters of Docs.

80 overlapping clusters of words described by their tf-idf in the 70 clusters of Docs.

70 x 2258

2258 x 7080 x 70

31454 documents x 2258 words

Page 48: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Next step: STATSYR

Each cluster of documents is described by the 80 clusters of words called “themes”

ThemesCl

asse

s of

do

cum

ents

WORDS in Each Theme

Page 49: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

GRAPHICAL REPRESENTATIONby NETSYR from SYR software

GRAPHICAL REPRESENTATION of themes , document classes, by Pie ChartsAnd their Bar chart description.

OverlappingClusters

SOCIAL NEWORKBased on dissimilarities

ANNOTATION : of Themes and Document classes

Moving, Zooming…We obtain finally a clear representation of the main themes , their classes and their links : “failures”, “budget”,”addresses”, “vacation” etc..

Page 50: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

A Survey on Security

• A sample of people of three regions (Vex, Val, Plai) have answered to three questions:

• Gender: M or W,

• Security: priority to Fight Against Unemployment

(FAU), Juvenile Delinquency (JD) Drug addict (D)),

• Death penalty (Yes or No).

Gender, Security , D. Penalty are «  barchart value variables »M, W, FAU, JD…are « bins »

Page 51: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

From barchart symbolic variables to Metabin latent variables

Region Gender Insecurity Death Penalty

- M W FAU JD D Yes No

Vex 0.8 0.2 0.4 0.5 0.1 0.5 0.5

Val 0.7 0.3 0.5 0.2 0.3 0.4 0.6

Plai 0.3 0.7 0.7 0.1 0.2 0.1 0.9

Table 1 Initial bar chart data table

Region S1cor S2cor S3cor

M JD Yes W FAU No NU D NUVex 0.8 0.5 0.5 0.2 0.4 0.5 NU 0.1 NUVal 0.7 0.2 0.4 0.3 0.5 0.6 NU 0.3 NUPlai 0.3 0.1 0.1 0.7 0.7 0.9 NU 0.2 NU

Table 2 Metabin latent variables

Page 52: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

CONCLUSION

• If you have standard units described by numerical and (or) categorical variables, these variables induce “classes” described by symbolic variables taking care of their internal variation. Then SDA can be applied on these new units in order to get complementary and enhancing results by extending standard analysis to symbolic analysis.

• Symbolic data have to be build from given standard or complex data.

• Symbolic data cannot be reduced to standard data. • Complex data can be simplified in symbolic data.• Big Data bases can be reduced in symbolic data• Symbolic data are not only distributions, they are the

numbers of the future.

Page 53: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Références

Basic books and papers:

•Bock H.H., Diday E. (editors and co-authors) ( 2000): Analysis of Symbolic Data.Exploratory methods for extracting statistical information from complex data. Springer Verlag, Heidelberg, 425 pages, ISBN 3-540-66619-2.•L. Billard, E. Diday (2003) "From the statistics of data to the statistic of knowledge: Symbolic Data Analysis". JASA . Journal of the American Statistical Association. Juin, Vol. 98, N° 462.•E. Diday, M. Noirhomme (eds and co-authors) (2008) “Symbolic Data Analysis and the SODAS software”. 457 pages. Wiley. ISBN 978-0-470-01883-5. •Billard, L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining. 321 pages. Wiley series in computational statistics. Wiley, Chichester, ISBN 0-470-09016-2.•Noirhomme-Fraiture, M. and Brito, P. (2012) Far beyond the classical data models: symbolic data analysis. Statistical Analysis and Data Mining 4 (2), 157-170. •Lazare N. (2013) "Symbolic Data Analysis". CHANCE magazine. Editor’s Letter – Vol. 26, No. 3.

Page 54: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Building Symbolic Data and representation Referencies

• Stéphan V., Hébrail G.,Lechevallier Y. (2000) « Generation of symbolic objects from relationnal data base ». Chapter in book : Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data (eds. H.-H.Bock and E. Diday). Springer-Verlag, Berlin, 103-124.

• Chiun-How, K., Chih-Wen, O., Yin-Jing, T., Chuan-kai, Yang, Chun-houh, Chen (2012) “A Symbolic Database for TIMSS”. Arroyo J., Maté C., Brito P. Noihomme M. eds, 3rd Workshop in Symbolic Data Analysis. Universidad Compiutense de Madrid. http://www.sda-workshop.org/.

• E. Diday, F. Afonso, R. Haddad (2013) : “The symbolic data analysis paradigm, discriminate discretization and financial application”. In Advances in Theory and Applications of High Dimensional and Symbolic Data Analysis, HDSDA 2013. Revue des Nouvelles Technologies de l'Information vol. RNTI-E-25, pp. 1-14

Page 55: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

SOME SYMBOLIC DATA ANALYSIS REFERENCIES

In Pricipal Component AnalysisCazes P., Chouakria A., Diday E., Schektman Y. (1997). Extension de l’analyse en composantes

principales à des données de type intervalle, Rev. Statistique Appliquées, Vol. XLV Num. 3, pp. 5-24, France. 29.

Cazes P. (2002) Analyse factorielle d’un tableau de lois de probabilité. Revue de statistique appliquée, tome 50, n0 3.

Diday E. (2013) "Principal Component Analysis for bar charts and Metabins tables". Statistical Analysis and Data Mining. Article first published online: 20 May 2013. DOI: 10.1002/sam.11188. 2013 Wiley. Statistical Analysis and Data Mining,6,5, 403-430.

Ichino, M. (2011). The quantile method for symbolic principal component analysis. Statistical Analysis and Data Mining, Wiley. 184-198.

Makosso-Kallyth S. and Diday E. (2012) Adaptation of interval PCA to symbolic histogram variables. Advances in Data Analysis and Classification (ADAC). July, Volume 6, Issue 2, pp 147-159.

Rademacher, J., Billard , L., (2012) Principal component analysis for interval data. Wiley interdisciplinary Reviews: Computational Statistics .Volume 4, Issue 6, pp. 535–540.

Shimizu N., Nakano J. (2012) Histograms Principal Component Analysis. Arroyo J., Maté C., Brito P. Noihomme M. eds, 3rd Workshop in Symbolic Data Analysis. Universidad Compiutense de Madrid. http://www.sda-workshop.org/

Wang H., Guan R., Wu J. (2012a). CIPCA: Complete-Information-based Principal Component Analysis for interval-valued data, Neurocomputing, Volume 86, Pages 158-169.

Page 56: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Symbolic Data Analysis references

In Symbolic ForecastingArroyo, J. and Maté, C. (2009). Forecasting histogram time series with k-nearest neighbors'

methods. International Journal of Forecasting 25, 192–207.

García-Ascanio, C.; Maté, C. (2010). Electric power demand forecasting using interval time series: A comparison between VAR and iMLP. Energy Policy 38, 715-725

Han, A., Hong, Y., Lai, K.K., Wang, S. (2008). Interval time series analysis with an application to the sterling-dollar exchange rate. Journal of Systems Science and Complexity, 21 (4), 550-565.

He, L.T. and C. Hu (2009). Impacts of Interval Computing on Stock Market Variability Forecasting. Computational Economics 33, 263-276.

In Symbolic rule extractionAfonso, F. et Diday, E. (2005). Extension de l’algorithme Apriori et des regles d’association

aux cas des donnees symboliques diagrammes et intervalles. Revue RNTI, Extraction et Gestion des Connaissances (EGC 2005), Vol. 1, pp 205-210, Cepadues, 2005.

Page 57: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Symbolic Data Analysis referencies

In Symbolic Decision TreeCiampi, A., Diday, E., Lebbe, J., Perinel, E. et Vignes, R. (2000). Growing a tree classifier with imprecise data. Pattern Recognition letters 21: 787-803.

Mballo C., Diday E. (2006)  The criterion of Smirnov-Kolmogorov for binary decision tree : application to interval valued variables. Intelligent Data Analysis. Volume 10, Number 4 . pp 325 – 341

Winsberg S., Diday E., Limam M. (2006). A tree structured classifier for symbolic class description. Compstat 2006. Physica-Verlag.

Bravo, M. et Garcia-Santesmases, J. (2000). Symbolic Object Description of Strata by Segmentation Trees, Computational Statistics, 15:13-24, Physica-Verlag.

Page 58: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

In Clustering• De Carvalho F., Souza R., Chavent M., and Lechevallier Y. (2006) Adaptive Hausdorff distances and dynamic

clustering of symbolic interval data. Pattern Recognition Letters Volume 27, Issue 3, February 2006, Pages 167-179.

• De Souza R.M.C.R, De Carvalho F.A.T. (2004). Clustering of interval data based on City-Block distances. Pattern Recognition Letters, 25, 353–365.

• Diday E. (2008) Spatial classification. DAM (Discrete Applied Mathematics) Volume 156, Issue 8, Pages 1271-1294.

• Diday, E., Murty, N. (2005) "Symbolic Data Clustering" in Encyclopedia of Data Warehousing and Mining . John Wong editor . Idea Group Reference Publisher.

• Irpino, A. and Verde, R. (2008): Dynamic clustering of interval data using a Wasserstein-based distance. Pattern Recognition Letters 29, 1648-1658.

In Multidimensional Scaling• Terada, Y., Yadohisa, H. (2011) Multidimensional scaling with hyperbox model for

percentile dissimilarities, In: Watada, J., Phillips-Wren, G., Jain, L. C., and Howlett, R. J. (Eds.): Intelligent Decision Technologies Springer Verlag, 779–788

• Groenen, P.J.F.,Winsberg, S., Rodriguez, O., Diday, E. (2006). I-Scal: Multidimensional scaling of interval dissimilarities. Computational Statistics and Data Analysis 51, 360–378.

Symbolic Data Analysis references

Page 59: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

In Self Organizing map• Hajjar C., Hamdan H. (2011). Self-organizing map based on L2 distance for interval-

valued data. In SACI 2011, 6th IEEE International Symposium on Applied Computational Intelligence and Informatics (Timisoara, Romania), pp. 317–322.P.

In Dissimilarities between Symbolic Data• Kim, J. and Billard, L. (2013): Dissimilarity measures for histogram-valued

observations, Communications in Statistics-Theory and Method, 42, 283-303.

• Verde, R., Irpino, A. (2010). Ordinary Least Squares for Histogram Data Based on Wasserstein Distance, in: Proc. COMPSTAT’2010, Y. Lechevallier and G.Saporta (Eds).PP.581-589. Physica Verlag Heidelberg.

Some Symbolic Data Analysis references

Page 60: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Some Symbolic Data Analysis references

In Regression and Canonical analysis extended to Symbolic Data

Dias, S., Brito, P., (2011). A New Linear Regression Model for Histogram-Valued Variables. In Proceedings of the 58th ISI World Statistics Congress (Dublin, Ireland).

Lauro, C., Verde, R. , Irpino, A. (2008). Generalized canonical analysis, in: Symbolic Data Analysis and the Sodas Software, E. Diday and M. Noirhomme. Fraiture (Eds.), 313-330, Wiley, Chichester.

Tenenhaus A., Diday E., Emilion R., Afonso F. (2013) Regularized General Canonical Correlation Analysis Extended To Symbolic Data. ADAC (publication on the way).

Neto, E.A, De Carvalho F.A.T. (2010). Constrained linear regression models for symbolic interval-valued variables. Computational Statistics and Data Analysis 54, 333-347.

Wang H., Guan R., Wu J. (2012c). Linear regression of interval-valued data based on complete information in hypercubes, Journal of Systems Science and Systems Engineering, Volume 21, Issue 4, Page 422-442.

Page 61: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

Some Symbolic Data Models referencies

• P. Bertrand, F. Goupil (2000) “ Descriptive Statistics for symbolic data“ . In H.H. Bock, E. Diday (Eds) “Analysis of Symbolic Data “. Springer-Verlag, pp. 106-124. 

• Brito, P. and Duarte Silva, A.P. (2012). Modelling interval data with Normal and Skew-Normal distributions. Journal of Applied Statistics, 39 (1), 3-20.

• E. Diday, M. Vrac (2005) "Mixture decomposition of distributions by Copulas in the symbolic data analysis framework". Discrete Applied Mathematics (DAM). Volume 147, Issue1, 1 April, pp. 27-41.

• E. Diday (2011) Modélisation de données symboliques et application au cas des intervalles. Journées Nationales de la Société Francophone de Classification. Orléans

• E. Diday (2002) “From Schweizer to Dempster: mixture decomposition of distributions by copulas in the symbolic data analysis framework” IPMU 2002, July, Annecy, France

• Diday E., Emilion R. (1997) "Treillis de Galois Maximaux et Capacités de Choquet" . C.R. Acad. Sc. t.325, Série 1, p 261-266. Présenté par G. Choquet en Analyse Mathématiques

• Diday E., R. Emilion (2003) Maximal and stochastic Galois lattices. Discrete appliedMath. Journal. Vol. 27 (2), pp. 271-284.

• Emilion R., Classification et mélanges de processus. C.R. Acad. Sci. Paris, 335, série I, 189-193 (2002).

• Emilion R., Unsupervised Classification and Analysis of objects described by nonparametric probability distributions. Statistical Analysis and Data Mining (SAM), Vol 5, 5, 388-398 (2012).

• J. Le-Rademacher, L. Billard (2011) “Likelihood functions and some maximum likelihood estimators for symbolic data”. Journal of Statistical Planning and Inference 141 1593–1602. Elsevier.

• T. Soubdhan, R. Emilion, R. Calif (2009) “Classification of daily solar radiation distributions”. Solar Energy 83 (2009) 1056–1063. Elsevier.

Page 62: INTRODUCTION TO SYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan.

• Afonso F., Diday E., Badez N., Genest Y. (2010) Symbolic Data Analysis of Complex Data: Application to nuclear power plant. COMPSTAT’2010 , Paris.

• Bezerra B., Carvalho F. (2011) Symbolic data analysis tools for recommendation systems. Knowl. Inf. Syst 01/2011; 26:385-418. DOI:10.1007/s10115-009-0282-3.

• Bouteiller V., Toque C., A., Cherrier J-F., Diday E., Cremona C. (2011) Non-destructive electrochemical characterizations of reinforced concrete corrosion: basic and symbolic data analysis. Corros Rev . Walter de Gruyter • Berlin • Boston. DOI 10.1515/corrrev-2011-002.

• Courtois, A., Genest, G., Afonso, F., Diday, E., Orcesi, A., (2012) In service inspection of reinforced concrete cooling towers – EDF’s feedback ,IALCCE 2012, Vienna, Austria

• Cury, A., Crémona, C., Diday, E. (2010). Application of symbolic data analysis for structural modification assessment. Engineering Structures Journal. Vol 32, pp 762-775.

• Christelle Fablet, Edwin Diday, Stephanie Bougeard, Carole Toque, Lynne Billard (2010). Classification of Hierarchical-Structured Data with Symbolic Analysis. Application to Veterinary Epidemiology. COMPSTAT’2010 , Paris.

• Haddad R., Afonso F., Diday E., (2011) Approche symbolique pour l'extraction de thématiques: Application à un corpus issu d'appels téléphoniques. In actes des XVIIIèmes Rencontres de la Sociéte francophone de Classification. Université d'Orléans

• Laaksonen, S. (2008). People’s Life Values and Trust Components in Europe - Symbolic Data Analysis for 20-22 Countries. In. Edwin Diday and Monique Noirhomme-Fraiture, “Symbolic Data Analysis and the SODAS Software", Chapter 22, pp. 405-419. Wiley and Sons: Chichester, UK.

• Quantin C., Billard L., Touati M., Andreu N., Cottin Y., Zeller M., Afonso F., Battaglia G., Seck D., Le Teuff G., and Diday E.. (2011) Classification and Regression Trees on Aggregate Data Modeling: An Application in Acute Myocardial Infarction. Journal of Probability and Statistics Volume 2011 (2011), 19 pages.

• Terraza V, Toque C. (2013) Mutual Fund Rating: A Symbolic Data Approach. In "Understanding Investment Funds Insights from Performance and Risk Analysis". Edited by Virginie Terraza and Hery Razafitombo . Economics & Finance Collection 2013. The Palgrave Macmilan editor. UK.

• He, L.T. and C. Hu (2009). Impacts of Interval Computing on Stock Market Variability Forecasting. Computational Economics 33, 263-276.

• E. Diday, F. Afonso, R. Haddad (2013) : The symbolic data analysis paradigm, discriminate discretization and financial application, in Advances in Theory and Applications of High Dimensional and Symbolic Data Analysis, HDSDA 2013. Revue des Nouvelles Technologies de l'Information vol. RNTI-E-25, pp. 1-14

• Han, A., Hong, Y., Lai, K.K., Wang, S. (2008). Interval time series analysis with an application to the sterling-dollar exchange rate. Journal of Systems Science and Complexity, 21 (4), 550-565.

Some SDA Industrial Applications