Identifying Objects Using Cluster and Concept Analysis Arie van Deursen Tobias Kuipers CWI, The Netherlands
Feb 25, 2016
Identifying Objects Using Cluster and Concept
Analysis
Arie van DeursenTobias Kuipers
CWI, The Netherlands
Motivation
• Legacy code incomprehensible– Lack of structure
• Case: >100,000 LOC Banking System– Cobol + VSAM data files
• Customer wanted OO redesign• Data central to the system
General Plan
• Find interesting data– Data selection– Candidate attributes
• Find interesting functionality– Program selection (procedure)– Candidate methods
• Combine the two– Candidate classes
Input Selection
• Domain related v. Implementation specific• Persistent data stores
– Only records written to/read from file– Refine by CRUD (Create/Read/Update/Delete)– Records too big for one class
• Analysis of Program Call Graph– high fan-out: control-programs– high fan-in: low-level technical
Combining Data & Functionality
• Cluster analysis -- technique for finding groups in data– Relies on metrics to compare distance between
data items• Concept analysis -- for finding groups too
– Relies on maximal subsets of data items sharing a set of features
Cluster Analysis
• Calculate distance (similarity) number between all data items (record fields)
• Use clustering to find hierarchyField Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1
DendrogramField Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1
0 1
NameTitleInitialPrefix
Dendrogram0 1
NameTitleInitialPrefix
NumberNb-ExtZipcode
Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1
Dendrogram0 1
NameTitleInitialPrefix
NumberNb-ExtZipcode
Distance is 1
Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1
Dendrogram0 1
NameTitleInitialPrefix
NumberNb-ExtZipcode
CityDistance is 1
Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1
Dendrogram0 1
NameTitleInitialPrefix
NumberNb-ExtZipcode
City
Street
Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1
Dendrogram0 1
NameTitleInitialPrefix
NumberNb-ExtZipcode
City
Street
Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1
Dendrogram0 1
NameTitleInitialPrefix
NumberNb-ExtZipcode
City
Street
Dendrogram from Real Data0 1 2
AmountAccountOfficeName
BankCityIntAccountOfficeType
PaymentKindRelationNr
ChangeDate
TitleCdPrefixInitial
ZipCdCountyCd
StreetNr
MortSeqNrMortNr
CityStreet
Name
Concept Analysis
• Relies on maximal subsets of data items sharing a set of features
• Concept analysis finds a latticeField Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x
Concept LatticeField Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x
All Variablestop
bottomP1 P2 P3 P4
Set of features
Set of items(field names)
Concept Lattice
top
P1
Name TitleInitial Prefix
P4
Number Nb-ExtZipcode Street City
P1 P2 P3 P4
bottom
All Variables
Field Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x
Concept Lattice
top
P1
Name TitleInitial Prefix
P4
P1 P2 P3 P4
P3 P4
Street
P2 P4
City
Number Nb-ExtZipcode Street City
All Variables
bottom
Field Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x
Concept Lattice
top
P1
Name TitleInitial Prefix
P4
P1 P2 P3 P4
P3 P4
Street
P2 P4
City
All Variables
Number Nb-ExtZipcode Street City
bottom
Real Concept Lattice
A B C D E F
1 2
3
4
G
H
I J K L
5
M N O P
6
Q R S
T U V W X
7
8 9 10 11 12 13 14
Concluding Remarks
• Variable Selection - Input filtering• Records are natural starting point in data-
intensive applications– Legacy/Cobol domain
• Records are too big: Decompose them• Cluster analysis v. Concept analysis
Cluster v Concept Analysis
• Multiple partitionings– Clustering does not show all possibilities
• Items in multiple groups• Features and clusters
– Origin of cluster decision is lost• Concept more efficient computationally• Clustering needs more filtering
Questions
Current Approaches
• Subsystem classification techniques– Survey, Lakhotia 97. Don’t work for Cobol, Cimitile 99
• Record as data part of a class– Newcomb & Kotik (‘95) take level 01 records, Fergen
et al (94) compare structure of records for reuse• Manual Methodology
– Sneed (‘92) provides manual methodology for migration of code, Sneed & Nyári (‘95) derive ‘OO’ documentation from legacy.