-
Dynamic Classification of Defect Structures in Molecular
DynamicsSimulation Data
S. Mehta∗ S. Barr† T. Choy† H. Yang∗ S. Parthasarathy∗ R.
Machiraju∗
J. Wilkins†∗Department of Computer Science and Engineering, Ohio
State University
†Department of Physics, Ohio State
UniversityContact:{mehtas,srini,raghu}@cse.ohio-state.edu
January 2, 2005
Abstract
In this application paper we explore techniques to
classifyanomalous structures (defects) in data generated from
ab-initio Molecular Dynamics (MD) simulations of Silicon (Si)atom
systems. These systems are studied to understand theprocesses
behind the formation of various defects as theyhave a profound
impact on the electrical and mechanicalproperties of Silicon. In
our prior work we presentedtechniques for defect detection [11, 12,
14]. Here, wepresent a two-step dynamic classifier to classify the
defects.The first step uses up to third-order shape moments
toprovide a smaller set of candidate defect classes. Thesecond step
assigns the correct class to the defect structureby considering the
actual spatial positions of the individualatoms. The dynamic
classifier is robust and scalable in thesize of the atom systems.
Each phase is immune to noise,which is characterized after a study
of the simulation data.We also validate the proposed solutions by
using a physicalmodel and properties of lattices. We demonstrate
the efficacyand correctness of our approach on several large
datasets.Our approach is able to recognize previously seen
defectsand also identify new defects in real time.
1 Introduction
Traditionally, the focus in the computational sciences hasbeen
on developing algorithms, implementations, and en-abling tools to
facilitate large scale realistic simulations ofphysical processes
and phenomenon. However, as simu-lations become more detailed and
realistic, and implemen-tations more efficient, scientists are
finding that analyzingthe data produced by these simulations is a
non-trivial task.Dataset size, providing reasonable response time,
and mod-eling the underlying scientific phenomenon during the
anal-ysis are some of the critical challenges that need to be
ad-dressed.
In this paper we present a framework that addresses
thesechallenges for mining datasets produced by Molecular Dy-namics
(MD) simulations to study the evolution of defectstructures in
materials. As component size decreases, a de-fect - any deviation
from the perfectly ordered crystal lattice -in a semiconductor
assumes ever greater significance. Thesedefects are often created
by introducing extra atom(s) in theSilicon lattice during ion
implantation for device fabrication.Such defects can aggregate to
form larger extended defects,which can significantly affect device
performance in an un-desirable fashion.Simulating defect dynamics
can potentially help scientistsunderstand how defects evolve over
time and how aggre-gated/extended defects are formed. Some of these
defectsare stable over a period of time while other are
short-lived.Efficient, automated or semi-automated analysis
techniquescan help simplify the task of wading through a pool of
dataand help quickly identify important rules governing
defectevolution, interactions and aggregation. The key
challengesare: i) to detect defects; ii) to characterize and
classify bothnew and previously seen defects accurately; iii) to
capturethe evolution and transitioning behavior of defects; and iv)
toidentify the rules that govern defect interactions and
aggre-gation. Manual analysis of these simulations is a very
cum-bersome process. It takes a domain expert more than eighthours
to manually analyze a very small simulation of 8000time frames.
Therefore, a systemic challenge is to develop anautomated, scalable
and incremental algorithmic frameworkso that the proposed
techniques can support in-vivo analysisin real time.In our previous
work [11, 12, 14], we presented algorithmsto address the first
challenge. Here we address the sec-ond challenge coupled with the
systemic challenge outlinedabove. The design tenets include not
only accuracy and ex-ecution time but also both statistical and
physical validationof the proposed models. We also present
preliminary results
-
to show that our approach can aid in handling the third
chal-lenge.The main contributions of our application case study
paperare:
1. We develop a two-step incremental classifier that clas-sifies
both existing and new defects (generates a newclass label).
2. We validate each step of our2-step classifier theoreti-cally,
relying on both physical and statistical models.
3. We validate our approach on large (greater than 4GB)real MD
simulation datasets and demonstrate both theexceptional accuracy
and efficiency of the proposedframework.
4. We present initial results which show that our approachcan be
used to capture defect evolution and to generatelabeled defect
trajectories.
Our paper is structured in the following manner. Sec-tion 2
discusses the basic terminology of MD and relatedwork. An overview
of our framework is provided in Sec-tion 3. We present our
algorithm in Section 4. Results onlarge simulation datasets are
presented in Section 5. Finally,we conclude and discuss directions
for future work in Sec-tion 6.
2 Background and Related Work
2.1 Background: In this section, we first define basicterms that
are used throughout this article. Later, we describepertinent
related work. Alattice is an arrangement of pointsor particles or
objects in a regular periodic pattern in threedimensions. The
elemental structure that is replicated in alattice is known as
aunit cell. Now, consider adding a singleatom to the lattice. This
extra atom disturbs the geometricstructure of lattice. This
disturbance, comprised of atomswhich deviate from the regular
geometry of lattice is referredto as thedefect structure. Defects
created by adding anextra atom are known as single-interstitial
defects. Similarlyone can definedi- andtri-interstitial defects by
adding twoor three single interstitial defects in a lattice
respectively.Figure 1(a) shows a Si bulk lattice with a certain
unitcell shaded differently (black). Figure 1(b) shows
anotherlattice with a single interstitial defect. Figure 1(c)
depictstwo interstitials: in the lower left and upper right
cornersrespectively of a512-atom lattice. The different shades
againrepresent separate and distinct defects.
We use the Object-Oriented High Performance Multi-scale
Materials Simulator (OHMMS) that some of us de-veloped (primarily
led by co-author Wilkins) [4] as ourworkhorse. The equation of
motion as described by New-ton’s second law is used to determine
atom locations. Whilethe exact forces must be derived from quantum
mechanical
studies and computations, these classical equations serve asa
suitable approximation.
2.2 Related Work: Traditionally, physicists have usedground
energy and electrostatic potential to find defects inlattice. For
example ab-initio methods are used to locate in-terstitial defects
in a Si lattice [2, 19]. These methods ex-ploit anomalies in the
energy/potential fields available at allpoints in the lattice .
However, the calculation and analy-sis of these energies/potential
is very time consuming. Themost pertinent work that is closely
related to our own workemploys the method of Common Neighbor
Analysis (CNA)[3, 8]. CNA attempts to glean crystallization
structure in lat-tices and uses the number of neighbors of each
atom to deter-mine the structure sought. However, it should be
noted thatthe distribution of neighbors alone cannot characterize
thedefects, especially at high temperatures. Related to our workis
the large body of work in biomolecular structure analysis[1, 10,
16, 20]. In these techniques, the data is often ab-stracted as
graphs and transactions and subsequently mined.However, such an
abstraction often does not exploit and ex-plain many of the
inherent spatial and dynamical propertiesof data that we are
interested in. Moreover, while some ofthese techniques [17, 20],
deal with noise in the data, thenoise handling capabilities are
limited to smoothing out un-correlated and small changes in spatial
location of atoms.Within the context of MD data, noise can also
change thenumber of defect atoms detected, for essentially the
samedefect structure at different time frames. None of the meth-ods
in the biological data mining literature deal with this
un-certainty. Matching of two structures has drawn a lot of
at-tention in recent past. Zhang et al.[22], propose a
proteinmatching algorithm that is rotation and translation
invariant.This method relies on the shape of the point cloud and
itworks well for proteins given the relatively large number
ofatoms; the presence of a few extra atoms does not changethe shape
of point cloud. A potential match is not stymiedby the presence of
extra atoms. However, in MD simulationdata, we are interested in
anomalous structures which canbe as small as just six atoms. Extra
atoms that may be in-cluded in a defect given the thermal noise
will skew a matchsignificantly even if the two defects differ by
one atom. Geo-metric hashing, an approach that was originally
developed inthe robotics and vision communities [21], has found
favorin the biomolecular structure analysis community [15,
20].Rotation and translations are well handled in this approach.The
main drawback of geometric hashing is that it is very ex-pensive
because an exhaustive search for optimal basis vec-tor is required.
A more detailed discussion on various shapematching algorithms can
be found in the survey paper byVeltamp and Hagedoorn [18]. We use
statistical moments torepresent the shape of a defect for initial
pruning. The semi-nal work by Hu [13] described a set of seven
moments which
-
Figure 1: (a)Original lattice with unit cell marked (b)Lattice
with one interstitial defect (c)Lattice with two interstitial
defects
captures many of the features of a two-dimensional binaryimages.
In recent work, Galvez [7] proposed an algorithmwhich uses shape
moments to characterize 3D structures us-ing moments of its
two-dimensional contours. However, ow-ing to relatively small
number of atoms present in a typicaldefect, the contours or the
corresponding implicit3D surfaceare impossible to obtain with
accuracy.
3 Dynamic Classification Framework
Figure 2 shows our framework for MD simulation dataanalysis. The
framework is divided into3 phases.Phase 1 -Defection Detection:
this phase detects, spatially clusters (orsegments) defects and
handles periodic boundary conditions.Detailed explanation on this
phase is given in our earlierwork [11, 14, 12]. Phase 2 - Dynamic
Classification: thisphase classifies each defect found inPhase 1.
This phaseconsists of three major components:
1. Generating a feature vector for each detected defect.This
feature vector is composed of weighted statisticalmoments.
2. Pruning the defect classes based on the feature vector.This
step provides a smaller subset of defect classes towhich an
unlabeled defect can potentially belong to.
3. An exact matching algorithm that assigns the correctclass
label to the defect. This steps takes into accountthe spatial
position of the atoms. The defect is assigneda class label if it
matches with any of the previouslyseen defects, otherwise the
defect is considered new.
Phase 2maintains 3 databases for all detected defects.Section 4
gives detailed information regarding these threedatabases and their
update strategies.The framework is made robust by modeling noise in
bothPhase 1and Phase 2. Our noise characterization modelsthe
aggregate movement of the defect structure and thearrangement of
the atoms (in terms of neighboring bonds).Our framework can be
deployed to operate in a streamingfashion. This is important since
it enables us to naturallyhandle data in a continuous fashion while
a simulation is in
progress.Phase 1handles the entire frame and detects allthe
defects. Each defect is then pipelined intoPhase 2. Thus,we are
able to incrementally detect and classify defects whileconsistently
updating the database in real time.Phase 3 - Knowledge Mining: this
phase uses the databasesgenerated byPhase 2. These databases store
the informationabout all the defects in a given simulation. These
databasescan be used to track and generate the trajectories of the
de-fects, which can assist us to better understand the defect
evo-lution process. Additionally, various data mining algorithmscan
be applied on the databases. Mining spatial patternswithin one
simulation can aid in understanding the interac-tions among
defects. Finding frequent patterns across multi-ple simulations can
help to predict defect evolution. In thispaper, we describePhase
2in detail. We also show someinitial results forPhase3.
4 Algorithm
Our previous work [11, 12, 14] describes the defect detec-tion
phase in detail. Every atom in the lattice is labeled ei-ther as a
bulk atom or as a defect atom. However, upon fur-ther evaluation we
found that this binary labeling is not wellsuited for robust
classification of defect structures. There-fore, we propose to
divide the atoms into three classes basedon their membership or
proximity to a defect. We also val-idate the correctness of this
taxonomy by using a physicalmodel. Our taxonomy is:
• Core-Bulk Atoms (CB): The atoms which conformto the set of
rules defined by the unit cell arebulk ornon-defect atoms. Bulk
atoms which are connectedexclusively to other bulk atoms are
labeled as core bulkatoms.
• Core-Defect Atoms (CD): The atoms which do notconform to the
set of rules aredefectatoms. All thedefect atoms which are
connected to more defect atomsthan bulk atoms are labeled as core
defect atoms. These
-
������������
��������
�������
���������������������
���������
������������
��� ������������
����� ������
������� �!��������"
�����#��������$��������
������%����!����������
&'
&
&
()��
������
Figure 2: Defect Detection and Classification Framework
atoms dominate the shape and properties of a defect.
• Interface Atoms (I): These atoms form the boundarybetween a
core bulk atom and a core defect atom. Theseatoms fail to conform
to the prescribed set of rules by asmall marginally (i.e. the
thresholds for bond lengthsand angles, are violated in a marginal
way) thus aremarked as defect atoms; however, the majority of
theirnearest neighbors are core bulk. Thus, they form aring between
core bulk atoms and core defect atoms.Presence or absence of these
atoms can considerablychange the shape of a defect, which makes
matching ofdefect structures very difficult.
Figure 4(a) illustrates all three types of atoms. The blackatoms
belong to core defect, white atoms form core bulkand the gray atoms
are interface atoms. We next, justify thistaxonomy.
Physical Validation: A lattice system can be representedby the
Mechanical Molecular Model as follows: atoms arerepresented by
spheres, and bonds by springs connectingthese spheres. The energy
of an atom in the lattice systemis calculated by the following
equation:
Etotal = Elength + Eangle + Einteractions
where Elength and Eangle are the energies due to bondstretching
and angle bending respectively. Also, each atomin the lattice
interacts with every other atom.Einteractionsaccounts for energy
generated by these interactions. How-ever, we can
ignoreEinteractions for the MD datasets be-cause OHHMS only
considers the effect of first neighbors of
(a) (b) )
Figure 3: (a) Original Defect (b) Detected Defect with
extraatoms
an atom while solving MD equations. The physics behindsprings
can be used to calculate these energies. Hooke’s lawfor springs
states that the force exerted by a spring is pro-portional to the
distance by which it is stretched. The energyof a spring can be
derived by using relationship between en-ergy and force. The energy
is calculated by the followingequation:
E =12Kδ2 (4.1)
whereK is the spring constant andδ is the distance by whicha
spring is stretched from its uncompressed state.Elength andEangle
for each atom can be computed by us-ing appropriate spring
constants,Klength and Kangle re-spectively. Information about the
ideal bond length andbond angle present in the existing literature
can be used
-
for uncompressed spring state. Drexer [5] lists the val-ues for
Klength and Kangle as 185 Newton/meter and0.35 Newton/radian
respectively for the Silicon lattice.For each atom we find the bond
lengths and bond angles itforms with its first neighbors and then
find theδs. Core bulkatoms deviate very little from the ideal bond
angle and bondlength, therefore their correspondingδs should be
very low;whereas for core defect atoms theδs should be high.
Sincethe energy is directly proportional toδ2, core bulk
atomsshould have low energy whereas core defect atoms shouldhave
high energy.To validate our taxonomy, we sampled1400 frames
fromdifferent simulations and calculated the energy for each atomin
the lattice. Figure 4(b) shows the distribution of energy. Itis
clear from the distribution that the majority of the atomshave very
low energy (∈ [0, 0.2]). These atoms are corebulk atoms. The core
defect atoms have very high energy(≥ 1.2). All the atoms which lie
between low and highenergy levels are interface atoms. Thus, this
physical modelclearly validates our taxonomy of atoms. Therefore we
refineour original binary labeling [11] of individual atoms
byfurther dividing defect atoms into core defect atoms andinterface
atoms.Before describing the classification method, we discuss
thechallenges which need to be addressed to build a robustand
efficient classifier for MD data. We list each of themand describe
how they are addressed within the context ofthe proposed algorithm.
For each proposed solution wealso provide the physical validation
using the MechanicalMolecular Model and properties of the lattice
system.
4.1 Challenges and Proposed Solutions
4.1.1 Thermal Noise: Thermal agitations can causeatoms to change
their spatial positions. Such changes canpotentially have two kinds
of effect on the defect structures:
1. The precise location of the atoms and their
inter-pairdistances will not be exactly the same from frameto
frame. Thus the classification method should betolerant to small
deviations in the spatial positions.
2. The change in the spatial positions can also force
apreviously labeled bulk atom to violate the rules andbe labeled as
a defect atom (and vice versa) in the nexttime frame. Therefore the
number of atoms in a defectcan change, which in turn changes the
overall shape ofthe defect structure which makes the classification
taskmore difficult.
To address the first problem we consider a data drivenapproach
to derive noise thresholds. From our study ofphysics, we know that
thechange in position of each atomin two consectutive framesis
influenced by the position and
number of its neighbors. To model this behavior we define
arandom variable Di:
Di = 1F+1 (Mi +F∑
j=1
Mj)
whereMi is the displacement of atomi between two consec-utive
time steps, havingF first neighbors within a distanceof 2.6Å (bond
length forSi), and i ∈ [1, N ], N being thetotal number of atoms in
the lattice. We empirically observethat Di can be effectively
approximated by using a normaldistribution with parameters,µnoise
andσnoise (the averagemean and standard deviation of allDis). We
foundµnoise tobe very close to 0 (which is expected because a given
atomcannot move very far from its original location between
twoconsecutive time frames). The parameterσnoise is used tomodel
the effect of noise in the defect classification algo-rithm. From a
set of randomly selected 4500 frames, wefound the value ofσnoise to
be 0.19̊A.
Physical Validation: The noise threshold can be validatedby
using the Mechanical Molecular Model. The bond energyB of Si-Si
bond is52 Kcal/mol, which is the amount ofenergy needed to break a
Si-Si bond. Using Equation 4.1we can compute the maximum distance a
Si atom can bedisplaced before the bond is broken. Essentially we
solvethe following equation for the value ofδ :
B ≤ 12Kδ2
By substituting the values ofK andB we found the valueof δ to
be0.2Å which implies that two bonded atoms cannotbe moved more
than 0.2Å apart without breaking the bondbetween them. Thus, the
empirically observed value is veryclose to the theoretical value
given by the physical model.
To solve the second problem posed by thermal noise, wepropose a
weighting mechanism. The weighting mechanismis based on the
following two observations:
• Observation 1:In two consecutive time frames the coredefect
atoms cannot change considerably.
• Observation 2:Interface atoms can make a transitionfrom bulk
to defect (and vice-versa) very quickly.
Figure 3(a) and Figure 3(b) show the defect detected fromtwo
different frames after applying local operators. Thedefect in
Figure 3(b) has extra atoms (interface) but thecore defect (black
atoms) remains unchanged. Therefore,a weighting mechanismis
proposed to reduce the influenceof interface atoms relative to that
of core atoms within adefect structure. Essentially, the weight
assigned to eachatom in a given defect is proportional to the
number of itsfirst neighbors in the defect structure. Thus core
defectatoms contribute more to defect classification than
interfaceatoms. These weights are also used for handling
translations
-
(a) (b)
Figure 4: (a) Taxonomy of atoms (b) Energy Plot
(described below) and for computing the feature vector(weighted
moments).
Physical Validation: Observation 1can be explained asfollows.
Each atom in the lattice interacts most with its firstneighbors.
The greater the number of first neighbors of anatom, the more
connected it is and hence the more restrictedits movement. The core
defect atoms have high connectivitywith other defect atoms, which
makes it more difficult forthem to move very far in a short period
of time.Observation 2can be explained as follows. Interface
atomslabeled as defect (or bulk), usually fail (or conform to)
theset of rules by a small margin. A very small variation in
theirspatial locations can change their labels. These
interfaceatoms, however, are very loosely connected to the core
defectatoms. Most of their first neighbors are core bulk atoms.To
summarize, over a period of time core defect atoms willchange
considerably less than interface atoms. Thereforemore emphasis
(weight) should be given to core defect atomswhile matching two
defect structures. This is precisely whatour weighting mechanism
does.
4.1.2 Translational and Rotational Invariance: Transla-tions and
rotations pose another problem in defect classi-fication. The same
defect can occur in different positionsand orientations in the
lattice. To classify a defect correctly,translations and rotations
should be resolved before assign-ing the class. We next present our
approach to attain transla-tional and rotational invariance.
Feature Vector Generation: We describe the shape ofa defect by
using statistical moments. We chose to useall first, second and
third order moments. Third order
moments capture skewness in defects. To account for theinterface
atoms we calculate weighted moments instead ofsimple moments.
(Recall that the weighting mechanismassigns high weights to core
defect atoms and low weights tointerface atoms). The feature vector
comprising of weightedmoments of a defect is calculated as :
Wmnp = 1N∑j=1
wj
N∑i=1
wi ∗Dmix ∗Dniy ∗Dpiz
where m + n + p ≤ 3
whereDix is thex-ordinate ofith atom of defectD. An im-portant
property of this feature vector is that it is translationinvariant
if the weighted center of mass (given by the firstthree weighted
moments) is translated to zero.Since all the rotations in the
lattice are symmetry operations(see below for an explanation of
symmetry operation), rota-tional ambiguity can be resolved easily
by applying the ap-propriate permutation on the feature vector. For
example, ifthe defect is rotated by180 degrees across theX-plane
(amirror transform), all the moments involving an odd powerof
theX-component will change sign. In a similar fashion,all the
rotations can be resolved by checking the pre-definedpermutations
of original moments. There are a total of3 first-order moments,6
second-order moments and10 third-ordermoments. Of these, since the
center of mass is translated tothe origin (to deal with
translations),W100, W010 andW001are all zero. Therefore we have
a16-dimensional featurevector represented byDw.
Physical Validation: Interface atoms change the shape ofthe
defect, which can change the center of mass considerably.
-
Therefore using center of mass without the weights canassign a
new class label to a defect even if the core defectis not new. Thus
core defect atoms should contribute moretowards the calculation of
the center of mass. A latticecannot be rotated in arbitrary
directions. The only rotationspossible in the lattice are those
whichcarry the lattice ontoitself. This means that after rotation
each atom of thelattice is exactly at a position occupied by an
atom priorto the rotation. These rotations are known
assymmetryoperations[9]. Under this constraint, only a finite
numberof rotations are possible in lattices. For example, in the
Silattice system only24 types of rotations are possible.
4.1.3 Shape Based Classification:While matching twodefect
structures, the classifier should take into account thepositions of
all individual atoms in the defects. This atom-to-atom matching is
relatively expensive. Furthermore becauseof large number of defect
classes present in simulationdatasets, it would be unrealistic to
carry out such an atom-to-atom matching for all classes at each and
every timestep. Therefore a scheme is needed to effectively reduce
thenumber of candidate classes on which an exhaustive atom-to-atom
matching is performed.We address this challenge by adopting a two
step classifi-cation process. The first step uses weighted moments
tofind a smaller subset of defect classes to which the unla-beled
defect can potentially belong and passes it to the nextstep.
Weighted moments (feature vector) are used becausemoments are known
to capture the overall shape of an ob-ject [13]. The second step
then finds the closest class bytaking into account the positions of
the atoms and their ar-rangement in three dimensional space. In
essence, bothsteps use the information about the shape of a defect.
Thefirst step uses the high level information about defect
struc-ture whereas the next step refines it by matching
individualatoms. We achieve the desired efficiency because the
firststep is computationally very cheap and reduces the searchspace
considerably for the next step. Experimental results tocorroborate
this are shown in Section 5.
Physical Validation: The majority of the physical propertiesof a
defect are governed by its shape. Most of the stabledefects seen so
far have a very compact shape. Unstablestructures tend to
re-organize the atoms to form such acompact structure. The movement
of the defect in the latticeis also governed by the shape of the
defect.
4.1.4 Emergence of new defect classes:The underlyingmotivation
of our effort is to discover information which canassist scientists
to better understand the physics behind de-fect evolution, ideally
in real time. This defect evolutioncan result in new defect classes
which are not in existingliterature. This requires the
classification process to be dy-
namic [6]. The classifier should be dynamic in the classi-cal
sense, as in new streaming data elements can be clas-sified, but
should also be dynamic in the sense that newclasses(defects) if
discovered can be added to the classifiermodel in real time. The
new defect should be available whenthe next frame is processed.We
next present our two-step classification process whichintegrates
all the proposed solutions to the above-mentionedchallenges.
4.2 Two Step Classification Algorithm: Phase1 of ourframework
detects the defect(s) from the lattice as men-tioned in Section 3
Phase2 classifies the defect(s) de-tected in Phase1. Given a
defectD, the goal is to find thetype T of this defect. IfD does not
match any of the pre-viously seen defects in the simulation, it is
labeled as a newdefect and stored in the databasesIDshape and
IDmoment,where ID is a unique simulation identification
number,IDshapestores the actual three dimensional co-ordinates
andIDmoment stores the weighted central moments (feature vec-tor)
of the defect structure. These databases store all theunique
defects detected in the current simulation. The labelof a new
defect is of the formdefect i j, indicating that thenew defect is
thejth defect in theith frame of the simula-tion. If D is not new
then a pointer to the defect class whichclosely matchesD is stored.
Besides these two databasesa summary file is generated which stores
names of all de-tected defects in the simulation along with
correspondingframe numbers. We now proceed to describe the two
stepsof our classifier in detail.
4.2.1 Step 1 - Feature Vector based Pruning:We usea variant of
the KNN classifier for this task. The value ofK is not fixed:
instead, it is determined dynamically foreach defect. Given the
feature vector (DW ) of a defectD, we compute and sort the
distances betweenDW andIDMi , whereIDMi is the mean moment vector
of thei
th
defect inIDmoment. All classes having distances less thanan
empirically-derived threshold are chosen as candidateclasses. Step
2, then, works on theseK classes only. Ifno class can be selected,D
is considered as a new defect.DatabasesIDshape andIDmoment are
updated immediately,so thatD is available when the next frame is
processed.In a similar fashion, one can use Naive Bayes and
Votingbased classifiers. Like theKNN classifier, these
classifiersalso provide metrics which can be used to select the
topKcandidate classes. More specifically, a Naive Bayes
classifierprovides the probabilities of a feature vector belonging
toeach class, and a voting based classifier gives the number
ofvotes for each class. The topK classes can then be chosenbased on
probabilities and votes. We chose VFI as our votingbased
classifier. As for other types of classifiers, such as thedecision
tree-based ones, it is not trivial to pickK candidate
-
classes, therefore they are not considered in this work.From the
three applicable classifiers, theKNN classifier ischosen because it
gives the highest classification accuracy, asdescribed in Section
5. Besides its high accuracy, theKNNclassifier is incremental in
nature. In other words, there isno need to re-build the
classification model from scratch ifa new class is discovered. In
contrast, Naive Bayes and VFIwill require the classification model
to be re-built every timea new class is discovered.The K candidate
classes are passed toStep 2. The repre-sentative shapes of theseK
classes are matched using an ex-act shape matching algorithm based
on the Largest CommonSubstructure (LCS). Next, we explain the main
steps of ourexact matching approach.
4.2.2 Step 2 - Largest Common Substructure basedalgorithm:
Assume,A is a defect of unknown type andB is the median defect
representing one of the candidateclasses fromStep 1. The defects
are mean centered and therotation is resolved. We next describe all
the steps of theLCS algorithm in detail.
• Atom Pairs Formation: The defects are sorted w.r.t.their
x-ordinate. Two atomsi and j in defectA forman atom pair Aij if
distance(Ai,Aj) ≤ bond length.This step uses the information about
neighbors andconnectivity calculated inPhase 1. These atom pairsare
calculated for both defects. For each atom pairAij , we store the
projection ontoX,Y and theZ-axesrepresented by Aijx, Aijy and Aijz
respectively.
• Find matching Pairs: For each pair Aij we find allpairs Bkl
such that
|Aijx −Bklx| ≤ σnoise|Aijy −Bkly| ≤ σnoise|Aijz −Bklz| ≤
σnoise
where thresholdσnoise is obtained as explained inSection
4.1.1.
We represent this equality of atom pairs asAij ↔ Bkl,which
implies that the length and orientation of thebond formed by atomsi
and j of defectA is similarto the bond formed by atomsk andl of
defectB.
By comparing each projection separately, we intrinsi-cally take
care of both: bond length and orientation.
• Find Largest Common Substructure (LCS): Therules generated in
the previous step are used to find thelargest common substructure
between two defects. Weuse a region growing based approach to find
LCS.
The pseudo code for finding LCS is shown in Figure 5.
Before explaining each step in detail, we define thenotion
ofcompatible substructures:
Two substructuresU andV are considered compatiblew.r.t. the
ruleAij ↔ Bkl, if the last atom added toU isthe ith atom of defectA
and the last atom added toVis thekth atom of defectB.
Being compatible implies that the two substructureshave the same
number of atoms and the orientationof atoms (which defines shape)
is approximately same(within noise thresholds).
The algorithm starts by finding allcompatible sub-structures U
andV w.r.t to the ruleAij ↔ Bkl (Line4). The length ofU (andV ) is
increased by1 and atomj (and l) is added. Lines 5-10 of Figure 5
show thisprocess. However, if nocompatible substructuresarefound
then a new substructureU (andV ) is initializedwith atomsi andj (k
and l). Lines 11-16 in Figure 5refer to this case. The same process
is then repeated forall the rules.
1 Input : All rules2 For each rule :Aij ↔ Bkl3 {4 Find
Compatible substructuresU andV5 If U andV found6 {7 Length =
Length+1;8 U [Length] = j;9 V [Length] = l;10 }11 else12 {13
CreatenewU andV ;14 Storei andj in U ;15 Storek andl in V ;16 }17
}
Figure 5: Pseudo code for finding Largest Common
Sub-structure
This method also provides the correspondence betweenatoms. Atoms
inU andV have a one-to-one relation-ship between them.
• Similarity Metric Computation: The LargestStructure(LS) is
then chosen from the common sub-structures. We use the following
metric to determinethe similarity betweenA andB:
Sim(A, B) =2 ∗ ‖LS‖‖A‖+ ‖B‖
-
This similarity is calculated betweenA and all theKcandidate
defect classes. The class which gives themaximum similarity greater
than a user defined thresh-old is chosen as the target class. If
the maximum sim-ilarity is less than the user defined threshold the
defectis considered new and both the databases,IDshape andIDmoment
are updated. The summary database is up-dated for each defect
(previously seen or new).
5 Experiments and Results
In this section we present the results of our framework. Asnoted
earlier we use OHMMS (see Section2) to generate thedatasets. We
first, show the advantage of weighted momentsover unweighted
moments by comparing the accuracies ofvarious classifiers. Next, we
demonstrate the accuracy of theLCS algorithm bootstrapped with
different classifiers:KNN,Naive Bayes and VFI. Later, we show the
scalable aspectsof our framework by deploying it on very large
datasets (inthe giga-byte range). Finally, we present preliminary
resultsdemonstrating how our two-step classifier can help us gain
abetter understanding of defect evolution.
5.1 Robust Classification:To illustrate the importance ofusing
weighted moments as opposed to unweighted mo-ments, we performed
the following experiment: a total of1, 400 defects were randomly
sampled across multiple simu-lations conducted at different
temperatures. The noise in thesimulation depends on the temperature
at which the lattice issimulated. Therefore two defects belonging
to the same classcan have different number of atoms and/or
different positionsof atoms depending on the temperature, even
though theircore defect shape remains approximately the same.
Thissampling strategy ensures that no two defects of the sameclass
are exactly the same. Each defect, in this experiment,belongs to
one of the fourteen classes of single interstitialdefects that are
known to arise in Si.For comparison purposes, we tried nine
different classifiers.Figure 6 clearly demonstrates that all
classifiers performbetter when weighted moments were used.
Classificationaccuracies ofVF1, KNN (K=1) and Decision
treebasedclassifiers are comparable (close to 90%).SMO(SVM
basedclassifier), also provided good accuracy (85%) but it wasquite
slow; classifying 1,400 defects took over25 minutes.On average the
classification accuracy increased by 8% whenweighted moments were
used.Next, we present the classification accuracies ofNaiveBayes,
KNN and VFI. These classifiers are modified to pickthe K most
important classes dynamically (as explainedin Section 4). Figure 7
shows the results for this experi-ment. KNN with weighted moments
outperforms all otherclassifiers by achieving an accuracy of 99%
whereasNaiveBayesis the least accurate with an accuracy of 86%.
Again,weighted moments outperform unweighted moments.
0
10
20
30
40
50
60
70
80
90
100
Naï
veBa
yes
LWL
Hyp
erpipe VF
I
Dec
isiont
ree
oneR
SMO
Jrip
KNN (K
=1)
Classifier
Accu
racy(%
)
Unweighted Moments
Weighted Moments
Figure 6: Accuracies of various classifiers
An important point to note is that all the1, 400 defectsused for
this experiment were labeled manually by a domainexpert. However,
in actual simulation data there are nopredetermined labels since
new classes can be created asthe simulation progresses. Also there
is no training datato build the initial model forDecision
treeandNaive Bayesclassifiers. For the purpose of this experiment,
we artificiallydivided the dataset into training (90%) and testing
data(10%) for all the classifiers that require training data to
buildmodel. Classifier accuracies are averaged over10 runs of
theclassifiers.Only KNN andVFI can discover new classes in real
time.Both classifiers calculate a similarity metric for
classifica-tion: distance in the case of KNN and votes for VFI. If
thissimilarity metric is less than a user defined threshold, a
newclass label can be assigned to the defect. However,VFI willhave
to build the whole classification model from start when-ever a new
defect class is discovered. Since large number ofdefect classes can
be created in a simulation, rebuilding theclassification model
repeatedly will degrade the performanceconsiderably.
0
10
20
30
40
50
60
70
80
90
100
KNN Bayes VfI
Classifier
Accu
racy(%
)
Unweighted Moments
Weighted Moments
Figure 7: Classification accuracy with DynamicK
-
(a) (b) (c)
Figure 8: Transitions in Three Interstitials(a) 1st Frame (b)
20000th frame (c) 130,000th frame
Thus theLCS algorithm bootstrapped with theKNN clas-sifier using
weighted central moments is the best choice interms of accuracy and
efficiency.
5.2 Scalable Classification - Large Simulations:We usethree
large datasets, namelyTwo Interstitials , Three Inter-stitials,
andFour Interstitials for these experiments. Ta-ble 1 summarizes
the number of frames, size of the dataset,total number of defects
present in the simulation and numberof unique defect classes
identified by our framework. For allthree datasets, our framework
was able to correctly identifyall the defect structures. However,
given the paucity of spacewe only present an in-depth analysis of
theThree Intersti-tials dataset. Similar results were also obtained
for otherdatasets.
Dataset Number of Size Total Defects Unique Defects
Frames (in GB) Detected
Two Interstitials 512,000 4 350,000 2,841
Three Interstitials 200,200 6 320,000 1,543
Four Interstitials 297,000 10 410,000 3,261
Table 1: Datasets Used in Evaluation
This simulation starts with three disconnected
interstitialdefects. The defects move around in the lattice during
thefirst 19, 000 time frames. However, at the20, 000th timeframe
two of the defects join and form a ’new’ larger defect.This larger
defect does not change for a long period oftime. However, at the
130,000th time frame the third defectjoins the’new’ defect and
forms a single large defect whichremains unchanged until the end of
the simulation. Figure 8shows the evolution of the defects in the
simulation. For therest of this paper we refer to changes in defect
shape or typeas ”transitions”.Though these transitions occur over a
large period (thou-sands of time steps), atoms do not stay at the
exact same po-sition in two consecutive frames due to thermal
noise. Suchthermal agitations can also cause bulk atoms to be
labeled
as defect atoms (and vice versa). As a result, there
existmarginal fluctuations in the shape of a defect from frame
toframe. However, the effect of these changes on weightedcentral
moments is relatively small. For example, in theThree Interstitials
dataset, the total number of defect instan-tiations in the
simulation was around320, 000. However, ourclassifier detected
only1, 543 unique defect classes. These1, 543 defects capture the
actual transitions as verified by adomain expert. To reiterate, the
use of weighted momentsminimizes false positives and ensures robust
classification.The use of weighted moments and pruning inStep 1
alsoallows our approach to achieve good scalability. Findingthe LCS
is a relatively expensive algorithm, therefore wewant to use it as
infrequently as possible. In most cases thenumber of candidate
classesK from Step 1(KNN classifier)of our dynamic classifier is
less than3. For example, inthe Two interstitials dataset2, 841
unique defects werefound however, the LCS algorithm only evaluates
less than3 closest matches. This underlines the usefulness of
thepruning step of our classifier. The discovery of all theunique
defect classes demonstrates that the correct defectclass is not
pruned away. To summarize, pruning basedon weighted moments
provides scalability to the frameworkwithout affecting the
accuracy.Many of these defects are not stable, i.e, they may
existfor as few as100 time frames; however these unstabledefect
structures are extremely important since they allowone to
understand the physics behind the creation of, andtransitioning to,
stable structures. We can easily eliminatethese unstable structures
from our repositories by eithermaintaining simple counts or by time
averaging the frames.However, using both these techniques will
result in loss oftransition information. To illustrate this point
we took thesameThree Interstitials dataset and averaged it over
every128 frames. In this averaged data, we found only 18
uniquedefects. It turns out that we found all the possible
stablestructures, but the actual transitioning behavior was
lost.
-
5.2.1 Timing Results: Figure 9 shows the time taken byOHHMS to
complete the simulation and time taken by ourframework to analyze
the data. The figure also shows the in-dividual time taken
byPhase1(defect detection) andPhase2(classification). Phase1takes
around 45% of the time andPhase2requires the rest of the time. All
the experimentsare carried out on Pentium 42.8GHz dual processor
ma-chine with1 GB of main memory. Our classifier can ana-lyze the
data almost1.5 times faster than the data generationrate. This
allows us to analyze the data and build the defectdatabases in real
time without dropping/losing any frames.Another advantage is that
we are not required to store thelarge simulation file (of the order
of 15GB) on disk. All theneeded information about defect type(s),
number(s) and tran-sitions, can be obtained from the simulation
databases andthe summary file.
0
5
10
15
20
25
30
35
40
Two Interstitial Three Interstitial Four Interstitial
Datasets
Tim
e (
in h
ou
rs)
OHMMS
Framework
Phase1
Phase2
Figure 9: Timing Results
Next we show how the results produced by the frameworkcan be
used for tracking and understanding the movement ofdefects in the
simulation.
5.3 Meta-stable Transitions: The transitions betweentwo
meta-stable defect structures are even more importantthan the
creation of transient structures. We, next present ex-perimental
results on a simulation that depicts the transitionof a single
defect to another defect.This is a fairly small but extremely
useful simulation. Thesimulation has1, 300 frames with67 atoms in
each frameand one interstitial defect. We were able to detect the50
unique defects which actually capture the transitionsfrom defect of
typeI3us-01 toI3us-03. (These labels areprovided by domain
experts). The defect does not breakinto multiple parts; therefore,
we do not have to deal withthe correspondence problem in this
case1. Again, all these
1Correspondence allows the labeling of two defects with the same
classlabel at two different time epochs.
results have been verified by our domain expert by
manuallychecking every frame of the simulation.
5.4 Generating defect trajectories:From the summarydatabase
produced at the end of simulation analysis, we canglean important
information about the movement of a defectin lattices. The summary
database provides information toconstruct a defect’s motion
trajectory over a period of time.We use a10, 000 frame simulation
to show this. In thissimulation the defect moves in the−z direction
through thelattice, reaches the end of the lattice and then stays
in thexyplane. We found70 unique defects in this simulation. All
thedetected defects are labeled as one of these70 classes. Mostof
these defects were highly unstable. We plot the (x,y,z)coordinates
of all the detected defect’s weighted centroid ateach time stamp.
Figure 10 clearly shows the movement ofthe defect in the−z
direction. This idea can be extendedto a multiple defect
simulation. Since the defects in thesummary database are labeled
therefore, it should be fairlyeasy to construct multiple
trajectories for multiple defects.By studying these labeled
trajectories, one can gain moreinsight on how a defect evolves and
interacts with otherdefects over time.
6 Conclusions
In this application case study, we propose a two-step
clas-sifier to classify the defects in large scale MD
simulationdatasets. The classifier is scalable and incremental in
nature.New classes of defects can be discovered and added to
clas-sifier model in real time. The approach is also robust to
noise(inherent to MD simulations). We present various noise
han-dling schemes and validate these schemes using a physicalmodel
and properties of the lattice systems. We demonstratethe
capabilities of our approach by deploying it on very largedatasets
(≥ 4GB). We were able to find a very small num-ber of unique defect
classes from these large datasets. Theseunique classes capture the
defect transitions very well.We are currently working on solving
the correspondenceproblem in the context of multiple defects. This
will enableus to build an automated system to capture important
eventssuch asdefect disintegrationanddefect amalgamation. An-other
future goal is to understand the interactions among de-fects in a
simulation. Towards this goal, we plan to model themovement of
defect as trajectories, tagged by defect classlabels, and analyze
these trajectories. We also plan to ap-ply other data mining
techniques including frequent itemsetmining and spatial patterns
mining to gain more insight inthe actual defect evolution
process.
References
[1] C. Borgelt and M. Berthold. Mining Molecular
fragments:Finding relevant substructures of molecules. InICDM,
2002.
-
Figure 10: Capturing the movement of defect
[2] S.J. Clark and G.J. Ackland. Ab initio calculation of the
selfinterstitial in silicon.Physical Review Letters vol. 56,
1997.
[3] A. S. Clarke and H. Jnsson. Structural changes accompany-ing
densification of random hard-sphere packings.PhysicalReview E
vol.47, pages 3975–3984, 1993.
[4] D. A. Richie, Jeongnim Kim, and John W. Wilkins.
Real-timemultiresolution analysis for accelerated molecular
dynamicssimulations. InAmerican Physical Society March
Meeting,2001.
[5] K. Eric Drexler. Nanosystems: molecular machinery,
manu-facturing, and computation. Wiley Publishers, 1992.
[6] L. Spencer G. Hulten and P. Domingos. Mining time-changing
data streams. InKnowledge and Data Discovery(SIGKDD), 2001.
[7] J.M. Galvez and M. Canton. Normalization and shape
recog-nition of three-dimensional objects by 3d
moments.PR,26:667–681, 1993.
[8] H. Jnsson and H. C. Andersen. Icosahedral ordering in
thelennard-jones liquid and glass.Physical Review Letters vol.60,
pages 2295–2298, 1988.
[9] Charles Kittel. Introduction to Solid State Physics.
JohnWiley and Sons, 1971.
[10] L. Dehapse, H. Toivonen and R. King. Finding Frequent
sub-structures in chemical compounds. InKnowledge Discoveryand Data
Mining, 1998.
[11] M. Jiang, T.-S. Choy, S. Mehta, M. Coatney, S. Barr, K.
Haz-zard, D. Richie, S. Parthasarathy, R. Machiraju, D. Thomp-son,
J. Wilkins, and B. Gatlin. Feature Mining Algorithmsfor Scientific
Data . InSIAM, 2003.
[12] Sameep Mehta, Kaden Hazzard, Raghu Machiraju,
SrinivasanParthasarathy, and John Wilkins. Detection and
visualizationof anamolous strcutures in molecular dynamics
simulationdata. InIEEE Conference on Visualization, 2004.
[13] M.hu. Visual Pattern Recognition by Moment Invariants.
In
IRE Trans Information Theory, pages 179–187.[14] R. Machiraju,
S. Parthasarathy, J. Wilkins, D. Thompson, B.
Gatlin, D. Richie, T. Choy, M. Jiang, S. Mehta, M. Coatney,and
S. Barr. Mining of Complex Evolutionary Phenomena,Next Generation
Data Mining. InNGDM, 2003.
[15] R. Nussinov and H. Wolfson. Efficient Detection of three
di-mensional Structural Motifs in Biological Macromolecules
byComputer Vision Techniques. InProceedings of the NationalAcademy
of Sciences of the United States of America, vol-ume 88, Dec 1,
1991.
[16] S. Djoko, D. Cook and L. Holder. Analyzing the benefits
ofdomain knowledge in substructure discovery. InKnowledgeDiscovery
and Data Mining, 1995.
[17] S. Parthasarathy and M. Coatney. Efficient Discovery
ofCommon Substructures in Macromolecules . InICDM, 2002.
[18] R. Veltkamp and M. Hagedoorn. State-of-the-art in
shapematching. Technical Report UU-CS-1999-27, Utrecht Uni-versity,
the Netherlands, 1999.
[19] R.J. Needs W.K. Leung and G. Rajagopal. Calculation
ofsilicon self interstitial defects.Physical Review Letters vol.83,
1999.
[20] X. Wang, J. Wang, D. Shasha, B. Shapiro, S. Dikshitulu,
I.Rigoutsos and K. Zhang. Automated discovery of active mo-tifs in
three dimensional molecules. InKnowledge Discoveryand Data Mining,
1997.
[21] Y. Lamdan and H. Wolfson. Geometric Hashing : a generaland
efficient model-based recognition scheme. InProceed-ings of the
second ICCV, pages 238–289, 1988.
[22] C. Zhang and T. Chen. Efficient Feature Extraction for
2D/3DObjects in mesh representation. InICIP, 2001.