Identifying Objects Using Cluster and Concept
Analysis
Arie van DeursenTobias Kuipers
CWI, The Netherlands
Motivation
• Legacy code incomprehensible– Lack of structure
• Case: >100,000 LOC Banking System– Cobol + VSAM data files
• Customer wanted OO redesign• Data central to the system
General Plan
• Find interesting data– Data selection– Candidate attributes
• Find interesting functionality– Program selection (procedure)– Candidate methods
• Combine the two– Candidate classes
Input Selection
• Domain related v. Implementation specific• Persistent data stores
– Only records written to/read from file– Refine by CRUD (Create/Read/Update/Delete)– Records too big for one class
• Analysis of Program Call Graph– high fan-out: control-programs– high fan-in: low-level technical
Combining Data & Functionality
• Cluster analysis -- technique for finding groups in data– Relies on metrics to compare distance between
data items• Concept analysis -- for finding groups too
– Relies on maximal subsets of data items sharing a set of features
Cluster Analysis
• Calculate distance (similarity) number between all data items (record fields)
• Use clustering to find hierarchyField Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1
DendrogramField Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1
0 1
NameTitleInitialPrefix
Dendrogram0 1
NameTitleInitialPrefix
NumberNb-ExtZipcode
Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1
Dendrogram0 1
NameTitleInitialPrefix
NumberNb-ExtZipcode
Distance is 1
Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1
Dendrogram0 1
NameTitleInitialPrefix
NumberNb-ExtZipcode
CityDistance is 1
Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1
Dendrogram0 1
NameTitleInitialPrefix
NumberNb-ExtZipcode
City
Street
Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1
Dendrogram0 1
NameTitleInitialPrefix
NumberNb-ExtZipcode
City
Street
Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1
Dendrogram0 1
NameTitleInitialPrefix
NumberNb-ExtZipcode
City
Street
Dendrogram from Real Data0 1 2
AmountAccountOfficeName
BankCityIntAccountOfficeType
PaymentKindRelationNr
ChangeDate
TitleCdPrefixInitial
ZipCdCountyCd
StreetNr
MortSeqNrMortNr
CityStreet
Name
Concept Analysis
• Relies on maximal subsets of data items sharing a set of features
• Concept analysis finds a latticeField Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x
Concept LatticeField Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x
All Variablestop
bottomP1 P2 P3 P4
Set of features
Set of items(field names)
Concept Lattice
top
P1
Name TitleInitial Prefix
P4
Number Nb-ExtZipcode Street City
P1 P2 P3 P4
bottom
All Variables
Field Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x
Concept Lattice
top
P1
Name TitleInitial Prefix
P4
P1 P2 P3 P4
P3 P4
Street
P2 P4
City
Number Nb-ExtZipcode Street City
All Variables
bottom
Field Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x
Concept Lattice
top
P1
Name TitleInitial Prefix
P4
P1 P2 P3 P4
P3 P4
Street
P2 P4
City
All Variables
Number Nb-ExtZipcode Street City
bottom
Real Concept Lattice
A B C D E F
1 2
3
4
G
H
I J K L
5
M N O P
6
Q R S
T U V W X
7
8 9 10 11 12 13 14
Concluding Remarks
• Variable Selection - Input filtering• Records are natural starting point in data-
intensive applications– Legacy/Cobol domain
• Records are too big: Decompose them• Cluster analysis v. Concept analysis
Cluster v Concept Analysis
• Multiple partitionings– Clustering does not show all possibilities
• Items in multiple groups• Features and clusters
– Origin of cluster decision is lost• Concept more efficient computationally• Clustering needs more filtering
Questions
Current Approaches
• Subsystem classification techniques– Survey, Lakhotia 97. Don’t work for Cobol, Cimitile 99
• Record as data part of a class– Newcomb & Kotik (‘95) take level 01 records, Fergen
et al (94) compare structure of records for reuse• Manual Methodology
– Sneed (‘92) provides manual methodology for migration of code, Sneed & Nyári (‘95) derive ‘OO’ documentation from legacy.