Graph Theoretic Latent Class Discovery and It’s Robustness to Minimal Dominating Set Choice J. L. Solka, C. E. Priebe, and D. J. Marchette [email protected];[email protected] NSWCDD Interface04 – p.1/24
Graph Theoretic LatentClass Discovery and It’s
Robustness to Minimal Dominating SetChoice
J. L. Solka, C. E. Priebe, and D. J. Marchette
[email protected];[email protected]
NSWCDD
Interface04 – p.1/24
Agenda
What is latent class discovery?
What are some approaches to the latent
class discovery process?
The class cover catch digraph classifier.
Latent class discovery results on a gene
expression data set.
Wrap-up and conclusions.
Interface04 – p.2/24
Acknowledgments
Michael C. Minnotte and Jurgen Symanzik,
and others for organizing the conference
Office of Naval Research through their ILIR
Program for funding this effort
Interface04 – p.3/24
What is Latent ClassDiscovery?
A latent class is a class of observations that reside
undiscovered within a known class of observations.
Develop a general methodology for the discernment of latent
class structure during discriminant analysis.
Moderately large hyperdimensional data sets.
During training or testing.
Explore applications of developed methodologies to the
analysis of data sets in the areas hyperdimensional image
analysis, artificial olfactory systems, computer security data,
gene expression data, and text data mining.
Interface04 – p.4/24
Flow Chart
HYPERDIMENSIONAL DATA
GRAPH THEORETICDISCRIMINANTANALYSIS
METRICSPACEADAPTATION
LATENTCLASSES
NONLINEARDIMENSIONALITYREDUCTION
MULTIDIMENSIONALSCALING
IINNSSIIGGHHTTSS
Interface04 – p.5/24
Dominating Set
two− class data andcovering discs
Dominatingset
Interface04 – p.6/24
CCCD-Based Latent Class Discovery
−6 −5 −4 −3 −2 −1 0 1 2 3 4−7
−6
−5
−4
−3
−2
−1
0
1
2
3
Interface04 – p.7/24
ALL/AML Leukemia GeneExpression Analysis
72 Patients
7129 genes
Apply CCCDto ALL Observations
Cluster CCCDSolution Based on Radii
Examine Clusters forLatent Class Structure
Ascertain Significance ofLatent Class Structure
= AML
= ALL B− cel l
= ALL T− cel l
Interface04 – p.8/24
Resubstitution ErrorRate Estimate
For each
� � ��� � � � � ��� an empirical risk (resubstitution error rate estimate)
��� iscalculated as
� � � � � � ��� � � � � ��
��� �� ��� � ��� �! � �#" $ $ $ " �&% ' (*),+ - ��. � / 0213' (),+4 3�5
� 6��� �
� ��7 � � �! � �#" $ $ $ " � % ' ( )+ - ��. � / 0213' ()8+4 3�5 �
Interface04 – p.9/24
ClassificationDimension
We proceed by defining the “scale dimension”
��� �
to be the cluster map dimension that
minimizes a dimensionality-penalized empirical risk;
��� ��� � / 021 ��� � � / 021 ��� � � �5
for
some penalty coefficient
� � � � �
.
Interface04 – p.10/24
ALL/AML ClassificationDimension Plot
Interface04 – p.11/24
Gene Latent ClassDiscovery
Interface04 – p.12/24
ALL/AML MDS Plot
Interface04 – p.13/24
How Robust is theMethodology?
One other “success” story using artificial nose data.
What if we had used another dominating set in ouranalysis?
Is the discovered latent class structure independent ofthe dominating set used?
Interface04 – p.14/24
An Exhaustive Enumeration ofAll Possible Dominating Setsfor the Gene Data
180 21 node solutions
16 of the nodes remain fixed across the solutions
14 greedy solutions
Interface04 – p.15/24
Classification Space Curvesfor the 180 Solutions
5 10 15 20
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Interface04 – p.16/24
Classification Dimension for the 180Solutions (red o Greedy Solutions,Green * Previous Solution)
0 20 40 60 80 100 120 140 160 1802
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
Interface04 – p.17/24
Number of Dominating Sets forEach Vertex
0 10 20 30 40
050
100
150
Number of Dominating sets for each vertex
Vertex
# D
omin
atin
g S
ets
T−CellB−CellIn−degree 0
Interface04 – p.18/24
Digraph Analysis
�
� �
� �
�
� �
�
� �
�
�
� �
� �
�
� �
� �
�
� �
� �
�
�
! "
#
$ %
& '
(
) *
+ ,
-
. /
0 1
2
3 4
5 6
78 9: ; < => ?@ 8 A: < 8 @ B: C < BD : E ; FG HD I@ J H < K C H F@ 98 @ 9 L < ; J 8 C <D 8 @ J H < =M N B I O8 @ F J 8 @ 9D < JDP
Q
R S
T U
V
W X
Y Z
[
\ ]
^ _
`
a b
c d
e
f g
h i
j
k l
m n
78 9: ; < o> ?@ 8 A: < 8 @ B: C < B D : E ; FG HD I@ J H < K C H F@ 98 @ 9 L < ; J 8 C <D 8 @ J H < = p B I O8 @ F J 8 @ 9 D < JD J H F J C I: q B ; <D : q J
r ; I O F 9 ; < < Bs F q 9 I ;8 J H OP
=
Interface04 – p.19/24
Latent Class DiscoveryFigures of Merit
How can we be assured that all of the greedy dominating set solutions discover thesame latent classes?
Previous greedy solution had 3 clusters that are pure B and 1 cluster thatcontained 8/9 of the T observations
Percentage of B points that are in pure B clusters and the highest percentage of Tpoints in any one cluster
Interface04 – p.20/24
Purity (Latent Class Discovery) forthe Golub Gene Data , Red Trianglesare the Greedy Solutions
0.4 0.5 0.6 0.7 0.8 0.9
0.80
0.85
0.90
0.95
1.00
bpercent
tper
cent
Interface04 – p.21/24
Remaining QuestionsDemonstrated similar latent class discovery among allof the greedy dominating set solutions
Many of the 7129 variates (genes) are superfluous tothe discriminant analysis problem
Work is ongoing to examine the discovered latentclasses based on subsets of the genes
Various figures of merit have been used to choose thesubsets of the genes
Interface04 – p.22/24
ConclusionsDeveloped a new concept for latent class discovery during
discriminant analysis
Illustrated one graph theoretic methodology for the
discovery of the latent classes
Illustrated this methodology with a gene expression data set.
Presented some preliminary results examining the
robustness of the discovery process to the cccd process
Interface04 – p.23/24
ReadingsC. E. Priebe, J. L. Solka, D. J. Marchette, and B. T. Clark, “Class Cover CatchDigraphs for Latent Class Discovery in Gene Expression Monitoring by DNAMicroarrays,” to appear the Special Issue of Computational Statistics and DataAnalysis on Statistical Visualization, 2002+.
J. L. Solka, C. E. Priebe, and B. T. Clark, “A Visualization Framework for theAnalysis of Hyperdimensional Data,” in International Journal of Image andGraphics Special Issue on Data Mining, 2002.
Marchette, D.J., Priebe, C.E., “Characterizing the scale dimension of ahigh-dimensional classification problem,”in Pattern Recognition,2002
Interface04 – p.24/24