YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: Identifying Objects  Using Cluster and Concept Analysis

Identifying Objects Using Cluster and Concept

Analysis

Arie van DeursenTobias Kuipers

CWI, The Netherlands

Page 2: Identifying Objects  Using Cluster and Concept Analysis

Motivation

• Legacy code incomprehensible– Lack of structure

• Case: >100,000 LOC Banking System– Cobol + VSAM data files

• Customer wanted OO redesign• Data central to the system

Page 3: Identifying Objects  Using Cluster and Concept Analysis

General Plan

• Find interesting data– Data selection– Candidate attributes

• Find interesting functionality– Program selection (procedure)– Candidate methods

• Combine the two– Candidate classes

Page 4: Identifying Objects  Using Cluster and Concept Analysis

Input Selection

• Domain related v. Implementation specific• Persistent data stores

– Only records written to/read from file– Refine by CRUD (Create/Read/Update/Delete)– Records too big for one class

• Analysis of Program Call Graph– high fan-out: control-programs– high fan-in: low-level technical

Page 5: Identifying Objects  Using Cluster and Concept Analysis

Combining Data & Functionality

• Cluster analysis -- technique for finding groups in data– Relies on metrics to compare distance between

data items• Concept analysis -- for finding groups too

– Relies on maximal subsets of data items sharing a set of features

Page 6: Identifying Objects  Using Cluster and Concept Analysis

Cluster Analysis

• Calculate distance (similarity) number between all data items (record fields)

• Use clustering to find hierarchyField Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

Page 7: Identifying Objects  Using Cluster and Concept Analysis

DendrogramField Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

0 1

NameTitleInitialPrefix

Page 8: Identifying Objects  Using Cluster and Concept Analysis

Dendrogram0 1

NameTitleInitialPrefix

NumberNb-ExtZipcode

Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

Page 9: Identifying Objects  Using Cluster and Concept Analysis

Dendrogram0 1

NameTitleInitialPrefix

NumberNb-ExtZipcode

Distance is 1

Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

Page 10: Identifying Objects  Using Cluster and Concept Analysis

Dendrogram0 1

NameTitleInitialPrefix

NumberNb-ExtZipcode

CityDistance is 1

Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

Page 11: Identifying Objects  Using Cluster and Concept Analysis

Dendrogram0 1

NameTitleInitialPrefix

NumberNb-ExtZipcode

City

Street

Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

Page 12: Identifying Objects  Using Cluster and Concept Analysis

Dendrogram0 1

NameTitleInitialPrefix

NumberNb-ExtZipcode

City

Street

Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

Page 13: Identifying Objects  Using Cluster and Concept Analysis

Dendrogram0 1

NameTitleInitialPrefix

NumberNb-ExtZipcode

City

Street

Page 14: Identifying Objects  Using Cluster and Concept Analysis

Dendrogram from Real Data0 1 2

AmountAccountOfficeName

BankCityIntAccountOfficeType

PaymentKindRelationNr

ChangeDate

TitleCdPrefixInitial

ZipCdCountyCd

StreetNr

MortSeqNrMortNr

CityStreet

Name

Page 15: Identifying Objects  Using Cluster and Concept Analysis

Concept Analysis

• Relies on maximal subsets of data items sharing a set of features

• Concept analysis finds a latticeField Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x

Page 16: Identifying Objects  Using Cluster and Concept Analysis

Concept LatticeField Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x

All Variablestop

bottomP1 P2 P3 P4

Set of features

Set of items(field names)

Page 17: Identifying Objects  Using Cluster and Concept Analysis

Concept Lattice

top

P1

Name TitleInitial Prefix

P4

Number Nb-ExtZipcode Street City

P1 P2 P3 P4

bottom

All Variables

Field Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x

Page 18: Identifying Objects  Using Cluster and Concept Analysis

Concept Lattice

top

P1

Name TitleInitial Prefix

P4

P1 P2 P3 P4

P3 P4

Street

P2 P4

City

Number Nb-ExtZipcode Street City

All Variables

bottom

Field Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x

Page 19: Identifying Objects  Using Cluster and Concept Analysis

Concept Lattice

top

P1

Name TitleInitial Prefix

P4

P1 P2 P3 P4

P3 P4

Street

P2 P4

City

All Variables

Number Nb-ExtZipcode Street City

bottom

Page 20: Identifying Objects  Using Cluster and Concept Analysis

Real Concept Lattice

A B C D E F

1 2

3

4

G

H

I J K L

5

M N O P

6

Q R S

T U V W X

7

8 9 10 11 12 13 14

Page 21: Identifying Objects  Using Cluster and Concept Analysis

Concluding Remarks

• Variable Selection - Input filtering• Records are natural starting point in data-

intensive applications– Legacy/Cobol domain

• Records are too big: Decompose them• Cluster analysis v. Concept analysis

Page 22: Identifying Objects  Using Cluster and Concept Analysis

Cluster v Concept Analysis

• Multiple partitionings– Clustering does not show all possibilities

• Items in multiple groups• Features and clusters

– Origin of cluster decision is lost• Concept more efficient computationally• Clustering needs more filtering

Page 23: Identifying Objects  Using Cluster and Concept Analysis

Questions

Page 24: Identifying Objects  Using Cluster and Concept Analysis

Current Approaches

• Subsystem classification techniques– Survey, Lakhotia 97. Don’t work for Cobol, Cimitile 99

• Record as data part of a class– Newcomb & Kotik (‘95) take level 01 records, Fergen

et al (94) compare structure of records for reuse• Manual Methodology

– Sneed (‘92) provides manual methodology for migration of code, Sneed & Nyári (‘95) derive ‘OO’ documentation from legacy.


Related Documents