Identifying Objects Using Cluster and Concept Analysis

Identifying Objects Using Cluster and Concept

Analysis

Arie van DeursenTobias Kuipers

CWI, The Netherlands

Motivation

• Legacy code incomprehensible– Lack of structure

• Case: >100,000 LOC Banking System– Cobol + VSAM data files

• Customer wanted OO redesign• Data central to the system

General Plan

• Find interesting data– Data selection– Candidate attributes

• Find interesting functionality– Program selection (procedure)– Candidate methods

• Combine the two– Candidate classes

Input Selection

• Domain related v. Implementation specific• Persistent data stores

– Only records written to/read from file– Refine by CRUD (Create/Read/Update/Delete)– Records too big for one class

• Analysis of Program Call Graph– high fan-out: control-programs– high fan-in: low-level technical

Combining Data & Functionality

• Cluster analysis -- technique for finding groups in data– Relies on metrics to compare distance between

data items• Concept analysis -- for finding groups too

– Relies on maximal subsets of data items sharing a set of features

Cluster Analysis

• Calculate distance (similarity) number between all data items (record fields)

• Use clustering to find hierarchyField Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

DendrogramField Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

0 1

NameTitleInitialPrefix

Dendrogram0 1


NumberNb-ExtZipcode

Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

Dendrogram0 1


NumberNb-ExtZipcode

Distance is 1


Dendrogram0 1


NumberNb-ExtZipcode

CityDistance is 1


Dendrogram0 1


NumberNb-ExtZipcode

City

Street


Dendrogram0 1


NumberNb-ExtZipcode

City

Street


Dendrogram0 1


NumberNb-ExtZipcode

City

Street

Dendrogram from Real Data0 1 2

AmountAccountOfficeName

BankCityIntAccountOfficeType

PaymentKindRelationNr

ChangeDate

TitleCdPrefixInitial

ZipCdCountyCd

StreetNr

MortSeqNrMortNr

CityStreet

Name

Concept Analysis

• Relies on maximal subsets of data items sharing a set of features

• Concept analysis finds a latticeField Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x

Concept LatticeField Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x

All Variablestop

bottomP1 P2 P3 P4

Set of features

Set of items(field names)

Concept Lattice

top

P1

Name TitleInitial Prefix

P4

Number Nb-ExtZipcode Street City

P1 P2 P3 P4

bottom

All Variables

Field Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x

Concept Lattice

top

P1


P4

P1 P2 P3 P4

P3 P4

Street

P2 P4

City


All Variables

bottom

Field Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x

Concept Lattice

top

P1


P4

P1 P2 P3 P4

P3 P4

Street

P2 P4

City

All Variables


bottom

Real Concept Lattice

A B C D E F

1 2

3

4

G

H

I J K L

5

M N O P

6

Q R S

T U V W X

7

8 9 10 11 12 13 14

Concluding Remarks

• Variable Selection - Input filtering• Records are natural starting point in data-

intensive applications– Legacy/Cobol domain

• Records are too big: Decompose them• Cluster analysis v. Concept analysis

Cluster v Concept Analysis

• Multiple partitionings– Clustering does not show all possibilities

• Items in multiple groups• Features and clusters

– Origin of cluster decision is lost• Concept more efficient computationally• Clustering needs more filtering

Questions

Current Approaches

• Subsystem classification techniques– Survey, Lakhotia 97. Don’t work for Cobol, Cimitile 99

• Record as data part of a class– Newcomb & Kotik (‘95) take level 01 records, Fergen

et al (94) compare structure of records for reuse• Manual Methodology

– Sneed (‘92) provides manual methodology for migration of code, Sneed & Nyári (‘95) derive ‘OO’ documentation from legacy.

Identifying Objects Using Cluster and Concept Analysis

Documents

data itemsconcept analysis

maximal subsets of data

themcluster analysis

groups toorelies

oo redesigndata central

l5m n o p6qrstuvwx7891011

controlprogramshigh

natural starting point