Dr. David Dailey [email protected] Dr. Beverly Gocal [email protected] Dr. Deborah Whitfield [email protected]
Dec 27, 2015
Introduction Graph distance String Distance
◦ Definitions◦ Examples◦ Implementation◦ Theoretical Results◦ String Space Examples
Distance ◦ may be defined for any structure
Overlap of the substructures of two structures◦ Strings◦ Graphs◦ Algebraic structures◦ Semi-groups◦ Trees
Web site and web page similarity
Past 15 years◦ Over 20 papers on graph similarity◦ Several more on string similarity
Semi-Group Let T=(S, A) together with the
concatenation operation, where A consists of the set of axioms ◦ x, y S, xy S◦ x, y, z S, x(yz) = (xy)z
Graph: Let T=(S, A) together with a relation ~ where A consists of the set of axioms◦ x, y S, x ~ y y ~ x◦ x , (x ~ x)
String Let T=(S,A) together with an associative operation (expressed by concatenation). ◦ Then let Sn be defined recursively by
S1 = S and Sn = S x Sn-1 and S* be defined as the infinite union of ordered tuples:
S1 S2 …Sn
Levenshtein distance calculates minimum number of transformations
Largest shared substructure Smallest super structure All of these approaches are relative
Enumerate all substructures within T and U Union those two sets (T* U*) =Z |Z|-dimensional vector space z(T) be the number of occurrences of
structure z as a substructure of T Calculate Minkowski distance d(T,U)
Alphabet S = {a,b,c}, = abaac and = cbaac *= {a,b,c,ab, ba,aa,ac,aba,baa,aac, abaa, baac,
abaac} * = {a,b,c,cb,ba,aa,ac,cba, baa, aac,cbaa,
baac,cbaac} Z= { a, b, c, ab, cb, ba, aa, ac, cba, aba, baa, aac,
cbaa, abaa, baac, cbaac, abaac } (underlined elements are unique to and boldfaced are unique to *)
Equal frequency: I = {b, c, ba, aa, ac, baa, aac, baac}
Different frequency: D={a}, Unique: O= {ab, cb, cba ,aba, cbaa, abaa, cbaac,
abaac} |I| = 8 , |D| = 1, and |O| = 8
|I| = 8 , |D| = 1, and |O| = 8 |I| +|D| +|O| = |Z| = 18 . Contribution of O is |O| Contribution of I is 0 - substrings appear
equally often Contribution of D, in this case will be 1. d(,) = contribution(I)+ contribution(D)+
contribution(O) = 9
A= aabc B= abcd S= {a, a, aa, aab, aabc, ab, abc, b, bc, c} T= {a, ab, abc, abcd, b, bc, bcd, c, cd, d} Counts for S and T
◦ a:2 aa:1 aab:1 aabc:1 ab:1 abc:1 b:1 bc:1 c:1◦ a:1 ab:1 abc:1 abcd:1 b:1 bc:1 bcd:1 c:1 cd:1 d:1
Differences: a:1 aad:1 aab:1 aabc:1 ab:0 abc:0 abcd:1 b:0 bc:0 bcd:1 c:0 cd1:0 d:1
Distance (aabc, abcd) = 8
Too tedious by hand http://srufaculty.sru.edu/david.dailey/javasc
ript/StringDistances.html
Distance (aabc, abcd) = 8
Conjecture: if ||=||=n and and share no substrings in common (i.e., |I D|=0), then d() = n(n+1)
Conjecture: if ||=||=n and a and b share no substrings in common (i.e., |I D|=0), then d() = n(n+1)
Lemma: if =an then d()= n2 + n(n+1)/2
Conjecture: if ||=||=n , then d()=d()=d()=d()=
n2 + n(n+1)/2
Pretty pics
Exhaustive substructure vector space Calculate distance Interesting observations used to study
structure similarity based on size