Dynamic Classiﬁcation of Defect Structures in Molecular Dynamics Simulation …web.cse.ohio-state.edu/dmrl/papers/siam05.pdf · 2008. 1. 23. · Dynamic Classiﬁcation of Defect

Dynamic Classification of Defect Structures in Molecular DynamicsSimulation Data

S. Mehta∗ S. Barr† T. Choy† H. Yang∗ S. Parthasarathy∗ R. Machiraju∗

J. Wilkins†∗Department of Computer Science and Engineering, Ohio State University

†Department of Physics, Ohio State UniversityContact:{mehtas,srini,raghu}@cse.ohio-state.edu

January 2, 2005

Abstract

In this application paper we explore techniques to classifyanomalous structures (defects) in data generated from ab-initio Molecular Dynamics (MD) simulations of Silicon (Si)atom systems. These systems are studied to understand theprocesses behind the formation of various defects as theyhave a profound impact on the electrical and mechanicalproperties of Silicon. In our prior work we presentedtechniques for defect detection [11, 12, 14]. Here, wepresent a two-step dynamic classifier to classify the defects.The first step uses up to third-order shape moments toprovide a smaller set of candidate defect classes. Thesecond step assigns the correct class to the defect structureby considering the actual spatial positions of the individualatoms. The dynamic classifier is robust and scalable in thesize of the atom systems. Each phase is immune to noise,which is characterized after a study of the simulation data.We also validate the proposed solutions by using a physicalmodel and properties of lattices. We demonstrate the efficacyand correctness of our approach on several large datasets.Our approach is able to recognize previously seen defectsand also identify new defects in real time.

1 Introduction

Traditionally, the focus in the computational sciences hasbeen on developing algorithms, implementations, and en-abling tools to facilitate large scale realistic simulations ofphysical processes and phenomenon. However, as simu-lations become more detailed and realistic, and implemen-tations more efficient, scientists are finding that analyzingthe data produced by these simulations is a non-trivial task.Dataset size, providing reasonable response time, and mod-eling the underlying scientific phenomenon during the anal-ysis are some of the critical challenges that need to be ad-dressed.

In this paper we present a framework that addresses thesechallenges for mining datasets produced by Molecular Dy-namics (MD) simulations to study the evolution of defectstructures in materials. As component size decreases, a de-fect - any deviation from the perfectly ordered crystal lattice -in a semiconductor assumes ever greater significance. Thesedefects are often created by introducing extra atom(s) in theSilicon lattice during ion implantation for device fabrication.Such defects can aggregate to form larger extended defects,which can significantly affect device performance in an un-desirable fashion.Simulating defect dynamics can potentially help scientistsunderstand how defects evolve over time and how aggre-gated/extended defects are formed. Some of these defectsare stable over a period of time while other are short-lived.Efficient, automated or semi-automated analysis techniquescan help simplify the task of wading through a pool of dataand help quickly identify important rules governing defectevolution, interactions and aggregation. The key challengesare: i) to detect defects; ii) to characterize and classify bothnew and previously seen defects accurately; iii) to capturethe evolution and transitioning behavior of defects; and iv) toidentify the rules that govern defect interactions and aggre-gation. Manual analysis of these simulations is a very cum-bersome process. It takes a domain expert more than eighthours to manually analyze a very small simulation of 8000time frames. Therefore, a systemic challenge is to develop anautomated, scalable and incremental algorithmic frameworkso that the proposed techniques can support in-vivo analysisin real time.In our previous work [11, 12, 14], we presented algorithmsto address the first challenge. Here we address the sec-ond challenge coupled with the systemic challenge outlinedabove. The design tenets include not only accuracy and ex-ecution time but also both statistical and physical validationof the proposed models. We also present preliminary results

to show that our approach can aid in handling the third chal-lenge.The main contributions of our application case study paperare:

1. We develop a two-step incremental classifier that clas-sifies both existing and new defects (generates a newclass label).

2. We validate each step of our2-step classifier theoreti-cally, relying on both physical and statistical models.

3. We validate our approach on large (greater than 4GB)real MD simulation datasets and demonstrate both theexceptional accuracy and efficiency of the proposedframework.

4. We present initial results which show that our approachcan be used to capture defect evolution and to generatelabeled defect trajectories.

Our paper is structured in the following manner. Sec-tion 2 discusses the basic terminology of MD and relatedwork. An overview of our framework is provided in Sec-tion 3. We present our algorithm in Section 4. Results onlarge simulation datasets are presented in Section 5. Finally,we conclude and discuss directions for future work in Sec-tion 6.

2 Background and Related Work

2.1 Background: In this section, we first define basicterms that are used throughout this article. Later, we describepertinent related work. Alattice is an arrangement of pointsor particles or objects in a regular periodic pattern in threedimensions. The elemental structure that is replicated in alattice is known as aunit cell. Now, consider adding a singleatom to the lattice. This extra atom disturbs the geometricstructure of lattice. This disturbance, comprised of atomswhich deviate from the regular geometry of lattice is referredto as thedefect structure. Defects created by adding anextra atom are known as single-interstitial defects. Similarlyone can definedi- andtri-interstitial defects by adding twoor three single interstitial defects in a lattice respectively.Figure 1(a) shows a Si bulk lattice with a certain unitcell shaded differently (black). Figure 1(b) shows anotherlattice with a single interstitial defect. Figure 1(c) depictstwo interstitials: in the lower left and upper right cornersrespectively of a512-atom lattice. The different shades againrepresent separate and distinct defects.

We use the Object-Oriented High Performance Multi-scale Materials Simulator (OHMMS) that some of us de-veloped (primarily led by co-author Wilkins) [4] as ourworkhorse. The equation of motion as described by New-ton’s second law is used to determine atom locations. Whilethe exact forces must be derived from quantum mechanical

studies and computations, these classical equations serve asa suitable approximation.

2.2 Related Work: Traditionally, physicists have usedground energy and electrostatic potential to find defects inlattice. For example ab-initio methods are used to locate in-terstitial defects in a Si lattice [2, 19]. These methods ex-ploit anomalies in the energy/potential fields available at allpoints in the lattice . However, the calculation and analy-sis of these energies/potential is very time consuming. Themost pertinent work that is closely related to our own workemploys the method of Common Neighbor Analysis (CNA)[3, 8]. CNA attempts to glean crystallization structure in lat-tices and uses the number of neighbors of each atom to deter-mine the structure sought. However, it should be noted thatthe distribution of neighbors alone cannot characterize thedefects, especially at high temperatures. Related to our workis the large body of work in biomolecular structure analysis[1, 10, 16, 20]. In these techniques, the data is often ab-stracted as graphs and transactions and subsequently mined.However, such an abstraction often does not exploit and ex-plain many of the inherent spatial and dynamical propertiesof data that we are interested in. Moreover, while some ofthese techniques [17, 20], deal with noise in the data, thenoise handling capabilities are limited to smoothing out un-correlated and small changes in spatial location of atoms.Within the context of MD data, noise can also change thenumber of defect atoms detected, for essentially the samedefect structure at different time frames. None of the meth-ods in the biological data mining literature deal with this un-certainty. Matching of two structures has drawn a lot of at-tention in recent past. Zhang et al.[22], propose a proteinmatching algorithm that is rotation and translation invariant.This method relies on the shape of the point cloud and itworks well for proteins given the relatively large number ofatoms; the presence of a few extra atoms does not changethe shape of point cloud. A potential match is not stymiedby the presence of extra atoms. However, in MD simulationdata, we are interested in anomalous structures which canbe as small as just six atoms. Extra atoms that may be in-cluded in a defect given the thermal noise will skew a matchsignificantly even if the two defects differ by one atom. Geo-metric hashing, an approach that was originally developed inthe robotics and vision communities [21], has found favorin the biomolecular structure analysis community [15, 20].Rotation and translations are well handled in this approach.The main drawback of geometric hashing is that it is very ex-pensive because an exhaustive search for optimal basis vec-tor is required. A more detailed discussion on various shapematching algorithms can be found in the survey paper byVeltamp and Hagedoorn [18]. We use statistical moments torepresent the shape of a defect for initial pruning. The semi-nal work by Hu [13] described a set of seven moments which

Figure 1: (a)Original lattice with unit cell marked (b)Lattice with one interstitial defect (c)Lattice with two interstitial defects

captures many of the features of a two-dimensional binaryimages. In recent work, Galvez [7] proposed an algorithmwhich uses shape moments to characterize 3D structures us-ing moments of its two-dimensional contours. However, ow-ing to relatively small number of atoms present in a typicaldefect, the contours or the corresponding implicit3D surfaceare impossible to obtain with accuracy.

3 Dynamic Classification Framework

Figure 2 shows our framework for MD simulation dataanalysis. The framework is divided into3 phases.Phase 1 -Defection Detection: this phase detects, spatially clusters (orsegments) defects and handles periodic boundary conditions.Detailed explanation on this phase is given in our earlierwork [11, 14, 12]. Phase 2 - Dynamic Classification: thisphase classifies each defect found inPhase 1. This phaseconsists of three major components:

1. Generating a feature vector for each detected defect.This feature vector is composed of weighted statisticalmoments.

2. Pruning the defect classes based on the feature vector.This step provides a smaller subset of defect classes towhich an unlabeled defect can potentially belong to.

3. An exact matching algorithm that assigns the correctclass label to the defect. This steps takes into accountthe spatial position of the atoms. The defect is assigneda class label if it matches with any of the previouslyseen defects, otherwise the defect is considered new.

Phase 2maintains 3 databases for all detected defects.Section 4 gives detailed information regarding these threedatabases and their update strategies.The framework is made robust by modeling noise in bothPhase 1and Phase 2. Our noise characterization modelsthe aggregate movement of the defect structure and thearrangement of the atoms (in terms of neighboring bonds).Our framework can be deployed to operate in a streamingfashion. This is important since it enables us to naturallyhandle data in a continuous fashion while a simulation is in

progress.Phase 1handles the entire frame and detects allthe defects. Each defect is then pipelined intoPhase 2. Thus,we are able to incrementally detect and classify defects whileconsistently updating the database in real time.Phase 3 - Knowledge Mining: this phase uses the databasesgenerated byPhase 2. These databases store the informationabout all the defects in a given simulation. These databasescan be used to track and generate the trajectories of the de-fects, which can assist us to better understand the defect evo-lution process. Additionally, various data mining algorithmscan be applied on the databases. Mining spatial patternswithin one simulation can aid in understanding the interac-tions among defects. Finding frequent patterns across multi-ple simulations can help to predict defect evolution. In thispaper, we describePhase 2in detail. We also show someinitial results forPhase3.

4 Algorithm

Our previous work [11, 12, 14] describes the defect detec-tion phase in detail. Every atom in the lattice is labeled ei-ther as a bulk atom or as a defect atom. However, upon fur-ther evaluation we found that this binary labeling is not wellsuited for robust classification of defect structures. There-fore, we propose to divide the atoms into three classes basedon their membership or proximity to a defect. We also val-idate the correctness of this taxonomy by using a physicalmodel. Our taxonomy is:

• Core-Bulk Atoms (CB): The atoms which conformto the set of rules defined by the unit cell arebulk ornon-defect atoms. Bulk atoms which are connectedexclusively to other bulk atoms are labeled as core bulkatoms.

• Core-Defect Atoms (CD): The atoms which do notconform to the set of rules aredefectatoms. All thedefect atoms which are connected to more defect atomsthan bulk atoms are labeled as core defect atoms. These

��

��

��

��

��

��

��

��

�� !��"

��#��$��

��%��!��

&'

&

&

()��

��

Figure 2: Defect Detection and Classification Framework

atoms dominate the shape and properties of a defect.

• Interface Atoms (I): These atoms form the boundarybetween a core bulk atom and a core defect atom. Theseatoms fail to conform to the prescribed set of rules by asmall marginally (i.e. the thresholds for bond lengthsand angles, are violated in a marginal way) thus aremarked as defect atoms; however, the majority of theirnearest neighbors are core bulk. Thus, they form aring between core bulk atoms and core defect atoms.Presence or absence of these atoms can considerablychange the shape of a defect, which makes matching ofdefect structures very difficult.

Figure 4(a) illustrates all three types of atoms. The blackatoms belong to core defect, white atoms form core bulkand the gray atoms are interface atoms. We next, justify thistaxonomy.

Physical Validation: A lattice system can be representedby the Mechanical Molecular Model as follows: atoms arerepresented by spheres, and bonds by springs connectingthese spheres. The energy of an atom in the lattice systemis calculated by the following equation:

Etotal = Elength + Eangle + Einteractions

where Elength and Eangle are the energies due to bondstretching and angle bending respectively. Also, each atomin the lattice interacts with every other atom.Einteractionsaccounts for energy generated by these interactions. How-ever, we can ignoreEinteractions for the MD datasets be-cause OHHMS only considers the effect of first neighbors of

(a) (b) )

Figure 3: (a) Original Defect (b) Detected Defect with extraatoms

an atom while solving MD equations. The physics behindsprings can be used to calculate these energies. Hooke’s lawfor springs states that the force exerted by a spring is pro-portional to the distance by which it is stretched. The energyof a spring can be derived by using relationship between en-ergy and force. The energy is calculated by the followingequation:

E =12Kδ2 (4.1)

whereK is the spring constant andδ is the distance by whicha spring is stretched from its uncompressed state.Elength andEangle for each atom can be computed by us-ing appropriate spring constants,Klength and Kangle re-spectively. Information about the ideal bond length andbond angle present in the existing literature can be used

for uncompressed spring state. Drexer [5] lists the val-ues for Klength and Kangle as 185 Newton/meter and0.35 Newton/radian respectively for the Silicon lattice.For each atom we find the bond lengths and bond angles itforms with its first neighbors and then find theδs. Core bulkatoms deviate very little from the ideal bond angle and bondlength, therefore their correspondingδs should be very low;whereas for core defect atoms theδs should be high. Sincethe energy is directly proportional toδ2, core bulk atomsshould have low energy whereas core defect atoms shouldhave high energy.To validate our taxonomy, we sampled1400 frames fromdifferent simulations and calculated the energy for each atomin the lattice. Figure 4(b) shows the distribution of energy. Itis clear from the distribution that the majority of the atomshave very low energy (∈ [0, 0.2]). These atoms are corebulk atoms. The core defect atoms have very high energy(≥ 1.2). All the atoms which lie between low and highenergy levels are interface atoms. Thus, this physical modelclearly validates our taxonomy of atoms. Therefore we refineour original binary labeling [11] of individual atoms byfurther dividing defect atoms into core defect atoms andinterface atoms.Before describing the classification method, we discuss thechallenges which need to be addressed to build a robustand efficient classifier for MD data. We list each of themand describe how they are addressed within the context ofthe proposed algorithm. For each proposed solution wealso provide the physical validation using the MechanicalMolecular Model and properties of the lattice system.

4.1 Challenges and Proposed Solutions

4.1.1 Thermal Noise: Thermal agitations can causeatoms to change their spatial positions. Such changes canpotentially have two kinds of effect on the defect structures:

1. The precise location of the atoms and their inter-pairdistances will not be exactly the same from frameto frame. Thus the classification method should betolerant to small deviations in the spatial positions.

2. The change in the spatial positions can also force apreviously labeled bulk atom to violate the rules andbe labeled as a defect atom (and vice versa) in the nexttime frame. Therefore the number of atoms in a defectcan change, which in turn changes the overall shape ofthe defect structure which makes the classification taskmore difficult.

To address the first problem we consider a data drivenapproach to derive noise thresholds. From our study ofphysics, we know that thechange in position of each atomin two consectutive framesis influenced by the position and

number of its neighbors. To model this behavior we define arandom variable Di:

Di = 1F+1 (Mi +F∑

j=1

Mj)

whereMi is the displacement of atomi between two consec-utive time steps, havingF first neighbors within a distanceof 2.6Å (bond length forSi), and i ∈ [1, N ], N being thetotal number of atoms in the lattice. We empirically observethat Di can be effectively approximated by using a normaldistribution with parameters,µnoise andσnoise (the averagemean and standard deviation of allDis). We foundµnoise tobe very close to 0 (which is expected because a given atomcannot move very far from its original location between twoconsecutive time frames). The parameterσnoise is used tomodel the effect of noise in the defect classification algo-rithm. From a set of randomly selected 4500 frames, wefound the value ofσnoise to be 0.19̊A.

Physical Validation: The noise threshold can be validatedby using the Mechanical Molecular Model. The bond energyB of Si-Si bond is52 Kcal/mol, which is the amount ofenergy needed to break a Si-Si bond. Using Equation 4.1we can compute the maximum distance a Si atom can bedisplaced before the bond is broken. Essentially we solvethe following equation for the value ofδ :

B ≤ 12Kδ2

By substituting the values ofK andB we found the valueof δ to be0.2Å which implies that two bonded atoms cannotbe moved more than 0.2Å apart without breaking the bondbetween them. Thus, the empirically observed value is veryclose to the theoretical value given by the physical model.

To solve the second problem posed by thermal noise, wepropose a weighting mechanism. The weighting mechanismis based on the following two observations:

• Observation 1:In two consecutive time frames the coredefect atoms cannot change considerably.

• Observation 2:Interface atoms can make a transitionfrom bulk to defect (and vice-versa) very quickly.

Figure 3(a) and Figure 3(b) show the defect detected fromtwo different frames after applying local operators. Thedefect in Figure 3(b) has extra atoms (interface) but thecore defect (black atoms) remains unchanged. Therefore,a weighting mechanismis proposed to reduce the influenceof interface atoms relative to that of core atoms within adefect structure. Essentially, the weight assigned to eachatom in a given defect is proportional to the number of itsfirst neighbors in the defect structure. Thus core defectatoms contribute more to defect classification than interfaceatoms. These weights are also used for handling translations

(a) (b)

Figure 4: (a) Taxonomy of atoms (b) Energy Plot

(described below) and for computing the feature vector(weighted moments).

Physical Validation: Observation 1can be explained asfollows. Each atom in the lattice interacts most with its firstneighbors. The greater the number of first neighbors of anatom, the more connected it is and hence the more restrictedits movement. The core defect atoms have high connectivitywith other defect atoms, which makes it more difficult forthem to move very far in a short period of time.Observation 2can be explained as follows. Interface atomslabeled as defect (or bulk), usually fail (or conform to) theset of rules by a small margin. A very small variation in theirspatial locations can change their labels. These interfaceatoms, however, are very loosely connected to the core defectatoms. Most of their first neighbors are core bulk atoms.To summarize, over a period of time core defect atoms willchange considerably less than interface atoms. Thereforemore emphasis (weight) should be given to core defect atomswhile matching two defect structures. This is precisely whatour weighting mechanism does.

4.1.2 Translational and Rotational Invariance: Transla-tions and rotations pose another problem in defect classi-fication. The same defect can occur in different positionsand orientations in the lattice. To classify a defect correctly,translations and rotations should be resolved before assign-ing the class. We next present our approach to attain transla-tional and rotational invariance.

Feature Vector Generation: We describe the shape ofa defect by using statistical moments. We chose to useall first, second and third order moments. Third order

moments capture skewness in defects. To account for theinterface atoms we calculate weighted moments instead ofsimple moments. (Recall that the weighting mechanismassigns high weights to core defect atoms and low weights tointerface atoms). The feature vector comprising of weightedmoments of a defect is calculated as :

Wmnp = 1N∑j=1

wj

N∑i=1

wi ∗Dmix ∗Dniy ∗Dpiz

where m + n + p ≤ 3

whereDix is thex-ordinate ofith atom of defectD. An im-portant property of this feature vector is that it is translationinvariant if the weighted center of mass (given by the firstthree weighted moments) is translated to zero.Since all the rotations in the lattice are symmetry operations(see below for an explanation of symmetry operation), rota-tional ambiguity can be resolved easily by applying the ap-propriate permutation on the feature vector. For example, ifthe defect is rotated by180 degrees across theX-plane (amirror transform), all the moments involving an odd powerof theX-component will change sign. In a similar fashion,all the rotations can be resolved by checking the pre-definedpermutations of original moments. There are a total of3 first-order moments,6 second-order moments and10 third-ordermoments. Of these, since the center of mass is translated tothe origin (to deal with translations),W100, W010 andW001are all zero. Therefore we have a16-dimensional featurevector represented byDw.

Physical Validation: Interface atoms change the shape ofthe defect, which can change the center of mass considerably.

Therefore using center of mass without the weights canassign a new class label to a defect even if the core defectis not new. Thus core defect atoms should contribute moretowards the calculation of the center of mass. A latticecannot be rotated in arbitrary directions. The only rotationspossible in the lattice are those whichcarry the lattice ontoitself. This means that after rotation each atom of thelattice is exactly at a position occupied by an atom priorto the rotation. These rotations are known assymmetryoperations[9]. Under this constraint, only a finite numberof rotations are possible in lattices. For example, in the Silattice system only24 types of rotations are possible.

4.1.3 Shape Based Classification:While matching twodefect structures, the classifier should take into account thepositions of all individual atoms in the defects. This atom-to-atom matching is relatively expensive. Furthermore becauseof large number of defect classes present in simulationdatasets, it would be unrealistic to carry out such an atom-to-atom matching for all classes at each and every timestep. Therefore a scheme is needed to effectively reduce thenumber of candidate classes on which an exhaustive atom-to-atom matching is performed.We address this challenge by adopting a two step classifi-cation process. The first step uses weighted moments tofind a smaller subset of defect classes to which the unla-beled defect can potentially belong and passes it to the nextstep. Weighted moments (feature vector) are used becausemoments are known to capture the overall shape of an ob-ject [13]. The second step then finds the closest class bytaking into account the positions of the atoms and their ar-rangement in three dimensional space. In essence, bothsteps use the information about the shape of a defect. Thefirst step uses the high level information about defect struc-ture whereas the next step refines it by matching individualatoms. We achieve the desired efficiency because the firststep is computationally very cheap and reduces the searchspace considerably for the next step. Experimental results tocorroborate this are shown in Section 5.

Physical Validation: The majority of the physical propertiesof a defect are governed by its shape. Most of the stabledefects seen so far have a very compact shape. Unstablestructures tend to re-organize the atoms to form such acompact structure. The movement of the defect in the latticeis also governed by the shape of the defect.

4.1.4 Emergence of new defect classes:The underlyingmotivation of our effort is to discover information which canassist scientists to better understand the physics behind de-fect evolution, ideally in real time. This defect evolutioncan result in new defect classes which are not in existingliterature. This requires the classification process to be dy-

namic [6]. The classifier should be dynamic in the classi-cal sense, as in new streaming data elements can be clas-sified, but should also be dynamic in the sense that newclasses(defects) if discovered can be added to the classifiermodel in real time. The new defect should be available whenthe next frame is processed.We next present our two-step classification process whichintegrates all the proposed solutions to the above-mentionedchallenges.

4.2 Two Step Classification Algorithm: Phase1 of ourframework detects the defect(s) from the lattice as men-tioned in Section 3 Phase2 classifies the defect(s) de-tected in Phase1. Given a defectD, the goal is to find thetype T of this defect. IfD does not match any of the pre-viously seen defects in the simulation, it is labeled as a newdefect and stored in the databasesIDshape and IDmoment,where ID is a unique simulation identification number,IDshapestores the actual three dimensional co-ordinates andIDmoment stores the weighted central moments (feature vec-tor) of the defect structure. These databases store all theunique defects detected in the current simulation. The labelof a new defect is of the formdefect i j, indicating that thenew defect is thejth defect in theith frame of the simula-tion. If D is not new then a pointer to the defect class whichclosely matchesD is stored. Besides these two databasesa summary file is generated which stores names of all de-tected defects in the simulation along with correspondingframe numbers. We now proceed to describe the two stepsof our classifier in detail.

4.2.1 Step 1 - Feature Vector based Pruning:We usea variant of the KNN classifier for this task. The value ofK is not fixed: instead, it is determined dynamically foreach defect. Given the feature vector (DW ) of a defectD, we compute and sort the distances betweenDW andIDMi , whereIDMi is the mean moment vector of thei

th

defect inIDmoment. All classes having distances less thanan empirically-derived threshold are chosen as candidateclasses. Step 2, then, works on theseK classes only. Ifno class can be selected,D is considered as a new defect.DatabasesIDshape andIDmoment are updated immediately,so thatD is available when the next frame is processed.In a similar fashion, one can use Naive Bayes and Votingbased classifiers. Like theKNN classifier, these classifiersalso provide metrics which can be used to select the topKcandidate classes. More specifically, a Naive Bayes classifierprovides the probabilities of a feature vector belonging toeach class, and a voting based classifier gives the number ofvotes for each class. The topK classes can then be chosenbased on probabilities and votes. We chose VFI as our votingbased classifier. As for other types of classifiers, such as thedecision tree-based ones, it is not trivial to pickK candidate

classes, therefore they are not considered in this work.From the three applicable classifiers, theKNN classifier ischosen because it gives the highest classification accuracy, asdescribed in Section 5. Besides its high accuracy, theKNNclassifier is incremental in nature. In other words, there isno need to re-build the classification model from scratch ifa new class is discovered. In contrast, Naive Bayes and VFIwill require the classification model to be re-built every timea new class is discovered.The K candidate classes are passed toStep 2. The repre-sentative shapes of theseK classes are matched using an ex-act shape matching algorithm based on the Largest CommonSubstructure (LCS). Next, we explain the main steps of ourexact matching approach.

4.2.2 Step 2 - Largest Common Substructure basedalgorithm: Assume,A is a defect of unknown type andB is the median defect representing one of the candidateclasses fromStep 1. The defects are mean centered and therotation is resolved. We next describe all the steps of theLCS algorithm in detail.

• Atom Pairs Formation: The defects are sorted w.r.t.their x-ordinate. Two atomsi and j in defectA forman atom pair Aij if distance(Ai,Aj) ≤ bond length.This step uses the information about neighbors andconnectivity calculated inPhase 1. These atom pairsare calculated for both defects. For each atom pairAij , we store the projection ontoX,Y and theZ-axesrepresented by Aijx, Aijy and Aijz respectively.

• Find matching Pairs: For each pair Aij we find allpairs Bkl such that

|Aijx −Bklx| ≤ σnoise|Aijy −Bkly| ≤ σnoise|Aijz −Bklz| ≤ σnoise

where thresholdσnoise is obtained as explained inSection 4.1.1.

We represent this equality of atom pairs asAij ↔ Bkl,which implies that the length and orientation of thebond formed by atomsi and j of defectA is similarto the bond formed by atomsk andl of defectB.

By comparing each projection separately, we intrinsi-cally take care of both: bond length and orientation.

• Find Largest Common Substructure (LCS): Therules generated in the previous step are used to find thelargest common substructure between two defects. Weuse a region growing based approach to find LCS.

The pseudo code for finding LCS is shown in Figure 5.

Before explaining each step in detail, we define thenotion ofcompatible substructures:

Two substructuresU andV are considered compatiblew.r.t. the ruleAij ↔ Bkl, if the last atom added toU isthe ith atom of defectA and the last atom added toVis thekth atom of defectB.

Being compatible implies that the two substructureshave the same number of atoms and the orientationof atoms (which defines shape) is approximately same(within noise thresholds).

The algorithm starts by finding allcompatible sub-structures U andV w.r.t to the ruleAij ↔ Bkl (Line4). The length ofU (andV ) is increased by1 and atomj (and l) is added. Lines 5-10 of Figure 5 show thisprocess. However, if nocompatible substructuresarefound then a new substructureU (andV ) is initializedwith atomsi andj (k and l). Lines 11-16 in Figure 5refer to this case. The same process is then repeated forall the rules.

1 Input : All rules2 For each rule :Aij ↔ Bkl3 {4 Find Compatible substructuresU andV5 If U andV found6 {7 Length = Length+1;8 U [Length] = j;9 V [Length] = l;10 }11 else12 {13 CreatenewU andV ;14 Storei andj in U ;15 Storek andl in V ;16 }17 }

Figure 5: Pseudo code for finding Largest Common Sub-structure

This method also provides the correspondence betweenatoms. Atoms inU andV have a one-to-one relation-ship between them.

• Similarity Metric Computation: The LargestStructure(LS) is then chosen from the common sub-structures. We use the following metric to determinethe similarity betweenA andB:

Sim(A, B) =2 ∗ ‖LS‖‖A‖+ ‖B‖

This similarity is calculated betweenA and all theKcandidate defect classes. The class which gives themaximum similarity greater than a user defined thresh-old is chosen as the target class. If the maximum sim-ilarity is less than the user defined threshold the defectis considered new and both the databases,IDshape andIDmoment are updated. The summary database is up-dated for each defect (previously seen or new).

5 Experiments and Results

In this section we present the results of our framework. Asnoted earlier we use OHMMS (see Section2) to generate thedatasets. We first, show the advantage of weighted momentsover unweighted moments by comparing the accuracies ofvarious classifiers. Next, we demonstrate the accuracy of theLCS algorithm bootstrapped with different classifiers:KNN,Naive Bayes and VFI. Later, we show the scalable aspectsof our framework by deploying it on very large datasets (inthe giga-byte range). Finally, we present preliminary resultsdemonstrating how our two-step classifier can help us gain abetter understanding of defect evolution.

5.1 Robust Classification:To illustrate the importance ofusing weighted moments as opposed to unweighted mo-ments, we performed the following experiment: a total of1, 400 defects were randomly sampled across multiple simu-lations conducted at different temperatures. The noise in thesimulation depends on the temperature at which the lattice issimulated. Therefore two defects belonging to the same classcan have different number of atoms and/or different positionsof atoms depending on the temperature, even though theircore defect shape remains approximately the same. Thissampling strategy ensures that no two defects of the sameclass are exactly the same. Each defect, in this experiment,belongs to one of the fourteen classes of single interstitialdefects that are known to arise in Si.For comparison purposes, we tried nine different classifiers.Figure 6 clearly demonstrates that all classifiers performbetter when weighted moments were used. Classificationaccuracies ofVF1, KNN (K=1) and Decision treebasedclassifiers are comparable (close to 90%).SMO(SVM basedclassifier), also provided good accuracy (85%) but it wasquite slow; classifying 1,400 defects took over25 minutes.On average the classification accuracy increased by 8% whenweighted moments were used.Next, we present the classification accuracies ofNaiveBayes, KNN and VFI. These classifiers are modified to pickthe K most important classes dynamically (as explainedin Section 4). Figure 7 shows the results for this experi-ment. KNN with weighted moments outperforms all otherclassifiers by achieving an accuracy of 99% whereasNaiveBayesis the least accurate with an accuracy of 86%. Again,weighted moments outperform unweighted moments.

0

10

20

30

40

50

60

70

80

90

100

Naï

veBa

yes

LWL

Hyp

erpipe VF

I

Dec

isiont

ree

oneR

SMO

Jrip

KNN (K

=1)

Classifier

Accu

racy(%

)

Unweighted Moments

Weighted Moments

Figure 6: Accuracies of various classifiers

An important point to note is that all the1, 400 defectsused for this experiment were labeled manually by a domainexpert. However, in actual simulation data there are nopredetermined labels since new classes can be created asthe simulation progresses. Also there is no training datato build the initial model forDecision treeandNaive Bayesclassifiers. For the purpose of this experiment, we artificiallydivided the dataset into training (90%) and testing data(10%) for all the classifiers that require training data to buildmodel. Classifier accuracies are averaged over10 runs of theclassifiers.Only KNN andVFI can discover new classes in real time.Both classifiers calculate a similarity metric for classifica-tion: distance in the case of KNN and votes for VFI. If thissimilarity metric is less than a user defined threshold, a newclass label can be assigned to the defect. However,VFI willhave to build the whole classification model from start when-ever a new defect class is discovered. Since large number ofdefect classes can be created in a simulation, rebuilding theclassification model repeatedly will degrade the performanceconsiderably.

0

10

20

30

40

50

60

70

80

90

100

KNN Bayes VfI

Classifier

Accu

racy(%

)

Unweighted Moments

Weighted Moments

Figure 7: Classification accuracy with DynamicK

(a) (b) (c)

Figure 8: Transitions in Three Interstitials(a) 1st Frame (b) 20000th frame (c) 130,000th frame

Thus theLCS algorithm bootstrapped with theKNN clas-sifier using weighted central moments is the best choice interms of accuracy and efficiency.

5.2 Scalable Classification - Large Simulations:We usethree large datasets, namelyTwo Interstitials , Three Inter-stitials, andFour Interstitials for these experiments. Ta-ble 1 summarizes the number of frames, size of the dataset,total number of defects present in the simulation and numberof unique defect classes identified by our framework. For allthree datasets, our framework was able to correctly identifyall the defect structures. However, given the paucity of spacewe only present an in-depth analysis of theThree Intersti-tials dataset. Similar results were also obtained for otherdatasets.

Dataset Number of Size Total Defects Unique Defects

Frames (in GB) Detected

Two Interstitials 512,000 4 350,000 2,841

Three Interstitials 200,200 6 320,000 1,543

Four Interstitials 297,000 10 410,000 3,261

Table 1: Datasets Used in Evaluation

This simulation starts with three disconnected interstitialdefects. The defects move around in the lattice during thefirst 19, 000 time frames. However, at the20, 000th timeframe two of the defects join and form a ’new’ larger defect.This larger defect does not change for a long period oftime. However, at the 130,000th time frame the third defectjoins the’new’ defect and forms a single large defect whichremains unchanged until the end of the simulation. Figure 8shows the evolution of the defects in the simulation. For therest of this paper we refer to changes in defect shape or typeas ”transitions”.Though these transitions occur over a large period (thou-sands of time steps), atoms do not stay at the exact same po-sition in two consecutive frames due to thermal noise. Suchthermal agitations can also cause bulk atoms to be labeled

as defect atoms (and vice versa). As a result, there existmarginal fluctuations in the shape of a defect from frame toframe. However, the effect of these changes on weightedcentral moments is relatively small. For example, in theThree Interstitials dataset, the total number of defect instan-tiations in the simulation was around320, 000. However, ourclassifier detected only1, 543 unique defect classes. These1, 543 defects capture the actual transitions as verified by adomain expert. To reiterate, the use of weighted momentsminimizes false positives and ensures robust classification.The use of weighted moments and pruning inStep 1 alsoallows our approach to achieve good scalability. Findingthe LCS is a relatively expensive algorithm, therefore wewant to use it as infrequently as possible. In most cases thenumber of candidate classesK from Step 1(KNN classifier)of our dynamic classifier is less than3. For example, inthe Two interstitials dataset2, 841 unique defects werefound however, the LCS algorithm only evaluates less than3 closest matches. This underlines the usefulness of thepruning step of our classifier. The discovery of all theunique defect classes demonstrates that the correct defectclass is not pruned away. To summarize, pruning basedon weighted moments provides scalability to the frameworkwithout affecting the accuracy.Many of these defects are not stable, i.e, they may existfor as few as100 time frames; however these unstabledefect structures are extremely important since they allowone to understand the physics behind the creation of, andtransitioning to, stable structures. We can easily eliminatethese unstable structures from our repositories by eithermaintaining simple counts or by time averaging the frames.However, using both these techniques will result in loss oftransition information. To illustrate this point we took thesameThree Interstitials dataset and averaged it over every128 frames. In this averaged data, we found only 18 uniquedefects. It turns out that we found all the possible stablestructures, but the actual transitioning behavior was lost.

5.2.1 Timing Results: Figure 9 shows the time taken byOHHMS to complete the simulation and time taken by ourframework to analyze the data. The figure also shows the in-dividual time taken byPhase1(defect detection) andPhase2(classification). Phase1takes around 45% of the time andPhase2requires the rest of the time. All the experimentsare carried out on Pentium 42.8GHz dual processor ma-chine with1 GB of main memory. Our classifier can ana-lyze the data almost1.5 times faster than the data generationrate. This allows us to analyze the data and build the defectdatabases in real time without dropping/losing any frames.Another advantage is that we are not required to store thelarge simulation file (of the order of 15GB) on disk. All theneeded information about defect type(s), number(s) and tran-sitions, can be obtained from the simulation databases andthe summary file.

0

5

10

15

20

25

30

35

40

Two Interstitial Three Interstitial Four Interstitial

Datasets

Tim

e (

in h

ou

rs)

OHMMS

Framework

Phase1

Phase2

Figure 9: Timing Results

Next we show how the results produced by the frameworkcan be used for tracking and understanding the movement ofdefects in the simulation.

5.3 Meta-stable Transitions: The transitions betweentwo meta-stable defect structures are even more importantthan the creation of transient structures. We, next present ex-perimental results on a simulation that depicts the transitionof a single defect to another defect.This is a fairly small but extremely useful simulation. Thesimulation has1, 300 frames with67 atoms in each frameand one interstitial defect. We were able to detect the50 unique defects which actually capture the transitionsfrom defect of typeI3us-01 toI3us-03. (These labels areprovided by domain experts). The defect does not breakinto multiple parts; therefore, we do not have to deal withthe correspondence problem in this case1. Again, all these

1Correspondence allows the labeling of two defects with the same classlabel at two different time epochs.

results have been verified by our domain expert by manuallychecking every frame of the simulation.

5.4 Generating defect trajectories:From the summarydatabase produced at the end of simulation analysis, we canglean important information about the movement of a defectin lattices. The summary database provides information toconstruct a defect’s motion trajectory over a period of time.We use a10, 000 frame simulation to show this. In thissimulation the defect moves in the−z direction through thelattice, reaches the end of the lattice and then stays in thexyplane. We found70 unique defects in this simulation. All thedetected defects are labeled as one of these70 classes. Mostof these defects were highly unstable. We plot the (x,y,z)coordinates of all the detected defect’s weighted centroid ateach time stamp. Figure 10 clearly shows the movement ofthe defect in the−z direction. This idea can be extendedto a multiple defect simulation. Since the defects in thesummary database are labeled therefore, it should be fairlyeasy to construct multiple trajectories for multiple defects.By studying these labeled trajectories, one can gain moreinsight on how a defect evolves and interacts with otherdefects over time.

6 Conclusions

In this application case study, we propose a two-step clas-sifier to classify the defects in large scale MD simulationdatasets. The classifier is scalable and incremental in nature.New classes of defects can be discovered and added to clas-sifier model in real time. The approach is also robust to noise(inherent to MD simulations). We present various noise han-dling schemes and validate these schemes using a physicalmodel and properties of the lattice systems. We demonstratethe capabilities of our approach by deploying it on very largedatasets (≥ 4GB). We were able to find a very small num-ber of unique defect classes from these large datasets. Theseunique classes capture the defect transitions very well.We are currently working on solving the correspondenceproblem in the context of multiple defects. This will enableus to build an automated system to capture important eventssuch asdefect disintegrationanddefect amalgamation. An-other future goal is to understand the interactions among de-fects in a simulation. Towards this goal, we plan to model themovement of defect as trajectories, tagged by defect classlabels, and analyze these trajectories. We also plan to ap-ply other data mining techniques including frequent itemsetmining and spatial patterns mining to gain more insight inthe actual defect evolution process.

References

[1] C. Borgelt and M. Berthold. Mining Molecular fragments:Finding relevant substructures of molecules. InICDM, 2002.

Figure 10: Capturing the movement of defect

[2] S.J. Clark and G.J. Ackland. Ab initio calculation of the selfinterstitial in silicon.Physical Review Letters vol. 56, 1997.

[3] A. S. Clarke and H. Jnsson. Structural changes accompany-ing densification of random hard-sphere packings.PhysicalReview E vol.47, pages 3975–3984, 1993.

[4] D. A. Richie, Jeongnim Kim, and John W. Wilkins. Real-timemultiresolution analysis for accelerated molecular dynamicssimulations. InAmerican Physical Society March Meeting,2001.

[5] K. Eric Drexler. Nanosystems: molecular machinery, manu-facturing, and computation. Wiley Publishers, 1992.

[6] L. Spencer G. Hulten and P. Domingos. Mining time-changing data streams. InKnowledge and Data Discovery(SIGKDD), 2001.

[7] J.M. Galvez and M. Canton. Normalization and shape recog-nition of three-dimensional objects by 3d moments.PR,26:667–681, 1993.

[8] H. Jnsson and H. C. Andersen. Icosahedral ordering in thelennard-jones liquid and glass.Physical Review Letters vol.60, pages 2295–2298, 1988.

[9] Charles Kittel. Introduction to Solid State Physics. JohnWiley and Sons, 1971.

[10] L. Dehapse, H. Toivonen and R. King. Finding Frequent sub-structures in chemical compounds. InKnowledge Discoveryand Data Mining, 1998.

[11] M. Jiang, T.-S. Choy, S. Mehta, M. Coatney, S. Barr, K. Haz-zard, D. Richie, S. Parthasarathy, R. Machiraju, D. Thomp-son, J. Wilkins, and B. Gatlin. Feature Mining Algorithmsfor Scientific Data . InSIAM, 2003.

[12] Sameep Mehta, Kaden Hazzard, Raghu Machiraju, SrinivasanParthasarathy, and John Wilkins. Detection and visualizationof anamolous strcutures in molecular dynamics simulationdata. InIEEE Conference on Visualization, 2004.

[13] M.hu. Visual Pattern Recognition by Moment Invariants. In

IRE Trans Information Theory, pages 179–187.[14] R. Machiraju, S. Parthasarathy, J. Wilkins, D. Thompson, B.

Gatlin, D. Richie, T. Choy, M. Jiang, S. Mehta, M. Coatney,and S. Barr. Mining of Complex Evolutionary Phenomena,Next Generation Data Mining. InNGDM, 2003.

[15] R. Nussinov and H. Wolfson. Efficient Detection of three di-mensional Structural Motifs in Biological Macromolecules byComputer Vision Techniques. InProceedings of the NationalAcademy of Sciences of the United States of America, vol-ume 88, Dec 1, 1991.

[16] S. Djoko, D. Cook and L. Holder. Analyzing the benefits ofdomain knowledge in substructure discovery. InKnowledgeDiscovery and Data Mining, 1995.

[17] S. Parthasarathy and M. Coatney. Efficient Discovery ofCommon Substructures in Macromolecules . InICDM, 2002.

[18] R. Veltkamp and M. Hagedoorn. State-of-the-art in shapematching. Technical Report UU-CS-1999-27, Utrecht Uni-versity, the Netherlands, 1999.

[19] R.J. Needs W.K. Leung and G. Rajagopal. Calculation ofsilicon self interstitial defects.Physical Review Letters vol.83, 1999.

[20] X. Wang, J. Wang, D. Shasha, B. Shapiro, S. Dikshitulu, I.Rigoutsos and K. Zhang. Automated discovery of active mo-tifs in three dimensional molecules. InKnowledge Discoveryand Data Mining, 1997.

[21] Y. Lamdan and H. Wolfson. Geometric Hashing : a generaland efficient model-based recognition scheme. InProceed-ings of the second ICCV, pages 238–289, 1988.

[22] C. Zhang and T. Chen. Efficient Feature Extraction for 2D/3DObjects in mesh representation. InICIP, 2001.

Dynamic Classiﬁcation of Defect Structures in Molecular Dynamics Simulation …web.cse.ohio-state.edu/dmrl/papers/siam05.pdf · 2008. 1. 23. · Dynamic Classiﬁcation of Defect

Documents