Top Banner
DNA Codes and their Properties Lila Kari and Kalpana Mahalingam University of Western Ontario, Department of Computer Science, London, ON N6A5B7 lila, [email protected] Abstract. One of the main research topics in DNA computing is asso- ciated with the design of information encoding single or double stranded DNA strands that are “suitable” for computation. Double stranded or partially double stranded DNA occurs as a result of binding between complementary DNA single strands (A is complementary to T and C is complementary to G). This paper continues the study of the algebraic properties of DNA word sets that ensure that certain undesirable bonds do not occur. We formalize and investigate such properties of sets of se- quences, e.g., where no complement of a sequence is a prefix or suffix of another sequence or no complement of a concatenation of n sequences is a subword of the concatenation of n +1 sequences. The sets of code words that satisfy the above properties are called θ-prefix, θ-suffix and θ- intercode respectively, where θ is the formalization of the Watson-Crick complementarity. Lastly we develop certain methods of constructing such sets of DNA words with good properties and compute their informational entropy. 1 Introduction Several attempts have been made to address the problem of encoding informa- tion on DNA and many authors have proposed various solutions. A common approach has been to use the Hamming distance [2, 7–9, 27]. Experimental sep- aration of strands with “good” sequences that avoid intermolecular cross hy- bridization was reported in [5, 6]. In [12], Kari et.al. introduced a theoretical approach to the problem of designing code words. Theoretical properties of lan- guages that avoid certain undesirable hybridizations were discussed in [14–17, 19, 26]. Based on these ideas and code-theoretic properties, a computer program for generating code words is being developed [13, 21]. Another algorithm, based on backtracking, for generating such code words is also developed by Li [23]. In [22] the author used the notion of partial words with holes for the design of DNA strands. In this paper we continue the study of the algebraic properties of DNA lan- guages suitable for computation. More precisely, every biomolecular protocol in- volving DNA or RNA generates molecules whose sequences of nucleotides form a language over the four letter alphabet Δ = {A, G, C, T }. The Watson-Crick (W/C) complementarity of the nucleotides defines a natural involution mapping
12

DNA Codes and Their Properties

Apr 29, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DNA Codes and Their Properties

DNA Codes and their Properties

Lila Kari and Kalpana Mahalingam

University of Western Ontario,Department of Computer Science,

London, ON N6A5B7lila, [email protected]

Abstract. One of the main research topics in DNA computing is asso-ciated with the design of information encoding single or double strandedDNA strands that are “suitable” for computation. Double stranded orpartially double stranded DNA occurs as a result of binding betweencomplementary DNA single strands (A is complementary to T and Cis complementary to G). This paper continues the study of the algebraicproperties of DNA word sets that ensure that certain undesirable bondsdo not occur. We formalize and investigate such properties of sets of se-quences, e.g., where no complement of a sequence is a prefix or suffix ofanother sequence or no complement of a concatenation of n sequencesis a subword of the concatenation of n + 1 sequences. The sets of codewords that satisfy the above properties are called θ-prefix, θ-suffix and θ-intercode respectively, where θ is the formalization of the Watson-Crickcomplementarity. Lastly we develop certain methods of constructing suchsets of DNA words with good properties and compute their informationalentropy.

1 Introduction

Several attempts have been made to address the problem of encoding informa-tion on DNA and many authors have proposed various solutions. A commonapproach has been to use the Hamming distance [2, 7–9, 27]. Experimental sep-aration of strands with “good” sequences that avoid intermolecular cross hy-bridization was reported in [5, 6]. In [12], Kari et.al. introduced a theoreticalapproach to the problem of designing code words. Theoretical properties of lan-guages that avoid certain undesirable hybridizations were discussed in [14–17,19, 26]. Based on these ideas and code-theoretic properties, a computer programfor generating code words is being developed [13, 21]. Another algorithm, basedon backtracking, for generating such code words is also developed by Li [23]. In[22] the author used the notion of partial words with holes for the design of DNAstrands.

In this paper we continue the study of the algebraic properties of DNA lan-guages suitable for computation. More precisely, every biomolecular protocol in-volving DNA or RNA generates molecules whose sequences of nucleotides forma language over the four letter alphabet ∆ = {A, G,C, T}. The Watson-Crick(W/C) complementarity of the nucleotides defines a natural involution mapping

Page 2: DNA Codes and Their Properties

θ, A 7→ T and G 7→ C which is an anti-morphism of ∆∗. Undesirable Watson-Crick bonds (undesirable hybridizations) can be avoided if the language satisfiescertain coding properties. In this paper we concentrate on θ-prefix, θ-suffix andθ-intercode (i.e.) languages where no Watson-Crick complement of a word is aprefix or suffix of another word, respectively no Watson-Crick complement ofa composition of n words is a subword of a composition of n + 1 words (SeeFig 16 for the types of hybridizations that are avoided if a word set satisfiesthese properties). We start the paper with definitions of coding properties thatavoid intermolecular cross hybridizations. The notions of θ-prefix and θ-suffixlanguages have been defined in [17] under the names of θ-p-compliant and θ-s-compliant respectively. Here we also consider two additional coding propertiesnamely θ-bifix code and θ-intercode. We make several observations about theclosure properties of such languages. In particular, we concentrate on propertiesof languages that are preserved by union and concatenation.

.....

.....

(a)

(b)(c)

(d) (e)

Fig. 1. Various types of intermolecular hybridization that we want to avoid: (a) acode word is the reverse complement of a subword of a concatenation of two othercode words: θ-comma-free codes avoid such hybridizations, (b) the catenation of mcodewords is the reverse complement of a subword of a concatenation of compositionof m + 1 code words: θ-intercodes (a new notion introduced in this paper) avoid suchhybridizations (c) a code word is a reverse complement of a subword of another codeword: θ-infix codes avoid such hybridizations (d) a code word is the reverse complementof a suffix of another code word: θ-suffix codes avoid such hybridizations, (e) a codeword is the reverse complement of a prefix of another code word: θ-prefix codes avoidsuch hybiridzations. (The 3′ end is indicated by an arrow.)

Also, we show that if a set of DNA strands has “good” coding properties thatare preserved under concatenation, then the same properties will be preservedunder arbitrary ligation of the strands. Section 3 investigates closure propertiesof various types of involution codes. Algebraic properties of θ-intercodes are

Page 3: DNA Codes and Their Properties

discussed in Section 4. We introduce and discuss the properties of sets whose nelement subsets are θ-intercodes and θ-comma-free codes in Section 5. Section 6describes several methods to generate involution codes and also calculate theirinformational entropy. Since it turns out that the entropy of these generatedinvolution codes is greater than log 2, by the coding theorem ([1, 24]) it followsthat the constructed code words can be used to encode binary strings. We endwith a few concluding remarks.

2 Definitions and Properties

An alphabet Σ is a finite non-empty set of symbols. We will denote by ∆ thespecial case when the alphabet is {A,G,C, T} representing the DNA nucleotides.A word u over Σ is a finite sequence of symbols in Σ. We denote by Σ∗ the setof all words over Σ, including the empty word 1 and by Σ+, the set of all non-empty words over Σ. We note that with the concatenation operation on words,Σ∗ is the free monoid and Σ+ is the free semigroup generated by Σ. The lengthof a word u = a1 · · · an is n and is denoted by |u|. For words representing DNAsequences, we will use the following convention. A word u over ∆ denotes a DNAstrand in its 5′ → 3′ orientation. The Watson-Crick complement of the word u,also in orientation 5′ → 3′ is denoted by

←u . For example if u = AGGC then

←u= GCCT .

Throughout the rest of the paper, we concentrate on finite sets X ⊆ Σ+ thatare codes i.e. every word in X+ can be written uniquely as a product of wordsin X. For the background on codes we refer the reader to [4, 28]. We will needthe following definitions:

PPref(X) = {u | ∃v ∈ Σ+, uv ∈ X }PSuff(X) = {u | ∃v ∈ Σ+, vu ∈ X }PSub(X) = {u | ∃v1 , v2 ∈ Σ∗, v1 v2 6= 1 , v1uv2 ∈ X }

We define the set of prefixes, suffixes and subwords of a set of words as Pref ,Suff and Sub. Similarly, we have Suffk(w) = Suff(w)∩Σ k ,Prefk (w) = Pref(w)∩Σ k ,Subk (w) = Sub(w) ∩ Σ k .

We follow the definitions initiated in [12] and used in [13]. An involutionθ : Σ → Σ of a set Σ is a mapping such that θ2 equals the identity mapping,θ(θ(x)) = x, ∀x ∈ Σ. The mapping ν : ∆ → ∆ defined by ν(A) = T , ν(T ) = A,ν(C) = G, ν(G) = C is an involution on ∆ and can be extended to a morphicinvolution of ∆∗. Since the Watson-Crick complementarity appears in a reverseorientation, we consider another involution ρ : ∆∗ → ∆∗ defined inductively,ρ(s) = s for s ∈ ∆ and ρ(us) = ρ(s)ρ(u) = sρ(u) for all s ∈ ∆ and u ∈ ∆∗.This involution is antimorphism such that ρ(uv) = ρ(v)ρ(u). The Watson-Crickcomplementarity then is the antimorphic involution obtained by the compositionνρ = ρν. Hence for a DNA strand u we have that ρν(u) = νρ(u) =

←u . The

involution ρ reverses the order of the letters in a word and as such is used in therest of the paper.

Page 4: DNA Codes and Their Properties

The following Definition 1 [14–17, 26] introduces notions meant to formalizea variety of language properties, each of whom guarantees the absence of acertain unwanted hybridization. The notion of θ-infix and θ-comma-free codewere introduced in [12] and was called θ-compliant and θ-free respectively. Thedefinition of θ-intercode and θ-outfix code are new notions introduced here.

Definition 1. Let θ : Σ∗ → Σ∗ be a morphic or antimorphic involution andX ⊆ Σ+.

1. The set X is called θ-infix-code if Σ∗θ(X)Σ+∩X = ∅ and Σ+θ(X)Σ∗∩X =∅.

2. The set X is called θ-comma-free-code if X2 ∩Σ+θ(X)Σ+ = ∅.3. The set X is called θ-strict-code if X ∩ θ(X) = ∅ .4. The set X is called θ-prefix-code if X ∩ θ(X)Σ+ = ∅.5. The set X is called θ-suffix-code if X ∩Σ+θ(X) = ∅.6. The set X is called θ-bifix-code if X is both θ-prefix and θ-suffix.7. The set X is called a θ-intercode if Xm+1 ∩Σ+θ(Xm)Σ+ = ∅, m ≥ 1. The

integer m is called the index of X.8. The set X is called θ-outfix-code if for u, θ(u1)xθ(u2) ∈ X with θ(u) =

θ(u1)θ(u2) implies x = 1 .

A code X is said to be θ-strict-prefix (suffix, infix, bifix, comma-free, outfix,intercode ) code if X is both θ-prefix(suffix, infix, bifix, comma-free, outfix,intercode respectively) and θ-strict. Note that θ-infix languages avoid undesir-able hybridization of the type depicted in Fig 1c, θ-comma-free languages avoidundesirable hybridization of the type depicted in Fig 1a, θ-intercodes avoid un-desirable hybridization of the type depicted in Fig 1b, θ-suffix languages avoidundesirable hybridization of the type depicted in Fig 1d, θ-prefix languages avoidundesirable hybridization of the type depicted in Fig 1e, θ-outfix languages avoidundesirable hybridization of the type depicted in Fig 2. Note that a θ-intercode

u

qx

p

Fig. 2. Another type of intermolecular hybridization that we want to avoid: the reversecomplement of a code word is a concatenation of a prefix and a suffix of another codeword. A θ-outfix code (a new notion defined in this paper) avoids such hybridizations

of index one is θ-comma-free. Also note that X is θ-intercode of index m if andonly if θ(X) is θ-intercode of index m. We have defined several properties thatare desirable for DNA languages to have. The properties 1 to 4 in Definition

Page 5: DNA Codes and Their Properties

1 have been extensively studied in [12, 14–17, 26]. Here we complete this studyby proving the relationship between several properties. The following proposi-tion shows the connection between θ-infix and θ-comma-free languages. We useθ : Σ∗ 7→ Σ∗ to be either morphic or antimorphic involution throughout thispaper unless specified.

Proposition 1. Let θ : Σ∗ → Σ∗ be a morphic or antimorphic involution andX ⊆ Σ∗. Then the following are equivalent.

1. X is a θ-comma-free code.2. X is θ-infix and θ(X) ∩ PSuff(X)PPref(X) = ∅.3. X is θ-infix and X2 ∩ PPref(X)θ(X)PSuff(X) = ∅.4. X is θ-infix and Xn ∩ (Σ+θ(X)Σ+Xn−2) = ∅.5. X is θ-infix and Xn ∩ (Xn−2Σ+θ(X)Σ+) = ∅.

Proposition 2. Let X ⊆ Σ+ be a θ-infix code. Then X3 ∩Σ+θ(X2)Σ+ = ∅ ifand only if θ(X2) ∩ PSuff(X)XPPref(X) = ∅.

Corollary 1. If X ⊆ Σ+ is θ-comma-free then X2∩PSuff(X)θ(X)PPref(X) =∅ and θ(X2) ∩ PSuff(X)XPPref(X) = ∅.

3 Closure Properties of Involution Codes

In this section we discuss several properties of θ-prefix, θ-suffix, θ-bifix, θ-outfixcodes and θ-strict codes. Besides being generalizations of outfix codes, the moti-vation behind introducing the notion of θ-outfix codes comes from the fact thata set of DNA words that is a θ-outfix code avoids any undesirable hybridizationof the type in Fig 2. Ensuring that no such unwanted hybridization occurs isobviously desirable from an experimental view point. It is interesting to notethat certain properties that are not satisfied by θ-prefix and θ-suffix codes aresatisfied by θ-bifix codes. In particular we discuss the conditions under whichsuch languages are closed under arbitrary concatenation. From a practical pointof view, these results give conditions under which, given a small finite set of“good” codewords, we can construct arbitrarily large sets of good code words byconcatenation.

Proposition 3. Let θ be a morphic involution.

1. If X is θ-prefix then Xn is θ-prefix for all n ≥ 1.2. If X is θ-suffix then Xn is θ-suffix for all n ≥ 1

Proposition 4. Let θ be morphic or antimorphic involution on Σ∗. If X is aθ-bifix code then Xn is a θ-bifix code for all n ≥ 1.

The next two propositions gives us conditions under which when a composi-tion of some arbitrary languages satisfy good encoding properties, the right andthe left context of such languages also satisfy the same good encoding properties.

Page 6: DNA Codes and Their Properties

Proposition 5. Let X ⊆ Σ+ be such that X is not a θ-strict code.

1. If Xm is θ-prefix for m ≥ 1, then X is θ-prefix.2. If Xm is θ-suffix for m ≥ 1, then X is θ-suffix.3. If Xm is θ-bifix for m ≥ 1, then X is θ-bifix.

Proposition 6. Let Xi, i = 1, 2, ..., m be non empty languages over Σ such thatXi∩θ(Xi) 6= ∅, i = 1, 2, ...,m. Let θ be a morphic involution. Then the followingholds true.

1. If X1X2...Xm is θ-prefix, then X2...Xm, X3...Xm,..., Xm−1Xm, Xm are θ-prefix codes.

2. If X1X2...Xm is θ-suffix, then X1...Xm−1, X1...Xm−2,..., X1X2, X1 are θ-suffix codes.

Proposition 7. For a morphic involution θ, the family of θ-prefix (θ-suffix)codes are closed under concatenation.

Note that the above proposition does not hold when θ is antimorphic. Forexample let X1 = {aa, baa} and X2 = {bb, bbb} over the alphabet set Σ = {a, b}and let θ be antimorphism such that a 7→ b and b 7→ a. Note that both X1 and X2

are θ-prefix but X1X2 is not θ-prefix since for aabb ∈ θ(X1X2), aabbb ∈ X1X2.

Proposition 8. For an antimorphic involution θ, if X1 and X2 are such thatX1 ∪X2 is θ-bifix, then X1X2 and X2X1 are θ-bifix.

Proposition 9. If X is θ-strict-infix code then X+ is both θ-prefix and θ-suffix.

It is easy to see that every θ-outfix code is θ-prefix and θ-suffix, hence aθ-bifix code. Also note that X is θ-outfix code if and only if θ(X) is a θ-outfixcode.

Proposition 10. For a morphic involution θ, the family of θ-outfix codes isclosed under concatenation.

Proposition 11. For a morphic involution θ, let X1, X2 ⊆ Σ+ be such thatXi ∩ θ(Xi) 6= ∅ for i = 1, 2. If X1X2 is θ-outfix code then both X1 and X2 areθ-outfix codes.

Proposition 12. If X is θ-outfix, then Xn, n ≥ 1 is θ-outfix.

Lemma 1. 1. If X1, X2 ⊆ Σ+ are θ-strict, then X1 ∪ X2 is not necessarilyθ-strict.

2. Let X1, X2 be θ-strict. Then X1∩θ(X2) = ∅ and X2∩θ(X1) = ∅ if and onlyif X1 ∪X2 is θ-strict.

3. If X1 and X2 are θ-strict, then X1 ∩X2 is θ-strict.4. Let X1 and X2 be θ-strict. When θ is morphism, if one of X1 or X2 is

θ-prefix , then X1X2 is θ-strict. When θ is antimorphism, if X1 ∪ X2 isθ-strict-bifix, then X1X2 is θ-strict.

5. If X is θ-strict-bifix, then X+ is θ-strict.6. X is θ-strict if and only if θ(X) is θ-strict.

Page 7: DNA Codes and Their Properties

4 Involution Intercodes

We now generalize the concept of θ-comma-free codes to θ-intercodes and studythe properties of such codes. Note that if θ is the identity function, a θ-intercodebecomes the well known notion of intercode, widely studied in the literature [28].Besides being generalizations of intercodes, the motivation behind introducingthe notion of θ-intercodes comes from the fact that a set of DNA words that isa θ-intercode avoids any undesirable hybridization of the type in Fig 1b. Ensur-ing that no such unwanted hybridization occurs is obviously desirable from anexperimental view point.

Proposition 13. Let X be a regular language. Then for a given m ≥ 1, it isdecidable whether or not X is a θ-intercode of index m.

Proposition 14. Let |Σ| ≥ 2. Then for any m ≥ 1, every θ-intercode of indexm is a θ-intercode of index m + 1.

Proposition 15. For any involution θ, every θ-intercode X such that X ∩θ(X) 6= ∅ is a θ-bifix code.

The converse of the above proposition is not true. For example let X ={aab, aba} over the alphabet set Σ = {a, b}. Let θ be a morphic involution witha 7→ b and b 7→ a. Note that X is both θ-prefix and θ-suffix but aaθ(aba)a =aababa ∈ X2. Hence X is not θ-intercode of index one. Also it is shown in [12]that every θ-comma-free code is θ-infix. But this is not the case for θ-intercodesof index m ≥ 2. One example is as follows: Let X = {b2ab3ab2, a3} over thealphabet set Σ = {a, b} and let θ be an antimorphic involution such that a 7→ band b 7→ a. The language X is θ-intercode of index 2 but not a θ-infix code.

Proposition 16. If X is θ-comma-free code then X is a θ-intercode of index mfor all m ≥ 1.

Note that the converse of the above proposition is not true. For examplelet X = {cbaa, baad, babb} over the alphabet set Σ = {a, b, c, d} . Let θ bean antimorphic involution with a 7→ b and c 7→ d. It is easy to check that X3 ∩Σ+θ(X2)Σ+ = ∅ but X is not θ-comma-free since cbθ(babb)ad = cbaabaad ∈ X2.

For any word u = a1a2...an ∈ Σ∗ with ai ∈ Σ define the reverse of u asu = anan−1...a2a1. For X ⊆ Σ+, define X = {u : u ∈ X}. The followingcharacterization of θ-intercodes of index m is an immediate result from thedefinition of θ-intercodes.

Proposition 17. Let X ⊆ Σ+. The following are equivalent.

1. X is a θ-intercode of index m.2. X is a θ-intercode of index m.3. For any u ∈ Xm, x, y ∈ Σ∗, xθ(u)y ∈ Xm+1 implies x = 1 or y = 1.

Proposition 18. If X is a θ-intercode of index m then Xk ∩Σ+θ(Xm)Σ+ = ∅for all k ≤ m + 1.

Page 8: DNA Codes and Their Properties

Proposition 19. If X ⊆ Σ+ is a θ-intercode of index m, m ≥ 1 and X isθ-strict-infix code, then Xn ∩ Σ+θ(Xm)Σ+ = ∅ and Xm ∩ Σθ(Xn)Σ+ = ∅ forall n ≥ m.

Proposition 20. If X is a θ-intercode of index m and X is strictly θ-infix, thenXn is a θ-intercode of index m, for all n ≥ 1.

Proposition 21. If X is a θ-intercode of index m and X is strictly θ-infix thenX+ is a θ-intercode of index m.

Proposition 22. If X∪Y is a θ-intercode of index m then XY is a θ-intercodeof index m.

5 n-θ-comma-free codes and n-θ-intercodes

If the alphabet Σ consists of more than one letter, the partial order ≤c definedon Σ∗ by u ≤c v if and only if v = xu = ux for some x ∈ Σ∗ plays an interestingrole. That is if u ≤c v, then u = f i for some primitive word f (f is primitive iff = ai, a ∈ Σ+ for some i implies i = 1) and v = f i+j for some j ≥ 0. Thus ifu, v ∈ X ⊆ Σ+ and X is an independent set with respect to ≤c, then uv 6= vu,which is equivalent to the fact that the two element set {u, v} is a code. Hence a≤c-independent set is called a 2-code. This notion can be generalized as follows:An n-code is a set X with the property that every n element subset of the set X isa code ([28]). The notion of n-codes, n-comma free codes and hence n-intercodeswere defined and studied in [28]. Here we extend these concepts to involutioncomma-free and involution intercodes as follows. This section investigates thesenotions and algebraic properties of these codes. An n-θ-intercode of index m isa language X ⊆ Σ+ such that every subset of X with at most n elements isa θ-intercode of index m. An n-θ-comma-free code is an n-θ-intercode of indexone.

Proposition 23. If X is a 2-θ-comma-free code then X is θ-infix.

Proposition 24. Let X ⊆ Σ+ be such that X ∩ θ(X) = ∅ and θ(PSuff(X)) ∩PPref(X) = ∅. Then the following are equivalent.

1. X is a 2-θ-comma-free code.2. X is θ-infix and for u, v ∈ Σ+, if uv ∈ θ(X) then X ∩ vΣ∗u = ∅.

Proposition 25. X is a 3-θ-comma-free code if and only if X is a θ-comma-freecode.

Proposition 26. If X is a k-θ-comma-free code then X is a m-θ-comma-freecode for all m ≤ k.

Proposition 27. X is a θ-intercode of index m if and only if X is a (2m + 1)-θ-intercode of index m.

Page 9: DNA Codes and Their Properties

Note that every θ-intercode of index m is an n-θ-intercode of index m forall n ≥ 1. But for n ≤ 2m an n-θ-intercode of index m is not necessarily aθ-intercode of index m.

Proposition 28. If X is a 2-θ-comma-free code , then Xy and yX are 2-θ-comma-free code for all y ∈ X.

Note that X being 2-θ-comma-free code does not imply Xn is 2-θ-comma-freecode. For example let X = {ebb, dae, aac, bcb}. Let θ be a morphic involution suchthat a 7→ b, c 7→ d and e 7→ e. It is easy to check that X is 2-θ-comma-free codebut X2 is not since ebbdae, aacbeb ∈ X2 with ebbdaeaacbeb = eθ(aacbeb)acbeb.

6 Methods for Constructing Involution Codes

With the constructions in this section we show several ways to generate involu-tion codes with “good” properties. Many authors have realized that in the designof DNA strands it is helpful to consider three out of the four bases. This wasthe case with several successful experiments [3, 8, 25]. It turns out that this, ora variation of this technique, can be generalized in such a way that codes withsome of the desired properties can be easily constructed. Methods to constructθ-infix, θ-comma-free, θ-k-code and θ-subword-k-codes were provided in [15]. Inthis section, we concentrate on providing methods to generate θ-prefix, θ-suffix,θ-bifix, θ-outfix and θ-intercodes X such that X+ has the same property. Someof these methods (Proposition 29) are in some sense generalizations of the ideaof considering only three out of four bases. For each code X, the entropy of X+

is computed. The entropy measures the information capacity of the codes, i.e.,the efficiency of these codes when used to represent information.

Suppose that we have a source alphabet with p symbols each occurring withprobability s1, s2, ..., sp. If s1 = 1, then there is no information since we knowwhat the message must be. If all the probabilities are different then for a sym-bol with low probability we get more information than for a symbol with highprobability. Hence information is somewhat inversely related to the probabilityof occurrence. Entropy is the average information over the whole alphabet ofsymbols.

The standard definition of entropy of a code X ⊆ Σ+ uses a probabilitydistribution over the symbols of the alphabet of X (see [4]). However, for a p-symbol alphabet, the maximal entropy is obtained when each symbol appearswith the same probability 1

p . In this case the entropy essentially counts theaverage number of words of a given length as subwords of the code words [20].From the Coding Theorem ([1]), it follows that {0, 1}+ can be encoded by X+

with Σ 7→ {0, 1} if the entropy of X+ is at least log 2 (see Theorem 5.2.5 in [24]).The codes for θ-comma-free, strictly θ-comma-free, and θ-k-codes designed inthis section have entropy larger than log 2 when the alphabet has p = 4 symbols.Hence, such DNA codes can be used for encoding bit-strings.

We start with the entropy definition as defined in [24].

Page 10: DNA Codes and Their Properties

Definition 2. Let X be a code. The entropy of X+ is defined by

h(X) = limn→∞1n

log |Subn(X+)|.

If G is a deterministic automaton or an automaton with a delay (see [24]) thatrecognizes X+ and AG is the adjacency matrix of G, then by Perron-Frobeniustheory AG has a maximal positive eigen value µ and the entropy of X+ is log µ(see Chapter 4 of [24]). We use this fact in the following computations of theentropies of the designed codes. In [12], Proposition 16, authors designed a set ofDNA code words that is strictly θ-comma-free. The following propositions showthat, in a similar way, we can construct codes with additional “good” properties.

In what follows we assume that Σ is a finite alphabet with |Σ| ≥ 3 andθ : Σ → Σ is an involution which is not identity. We denote by p the number ofsymbols in Σ.

Proposition 29. Let a ∈ Σ be such that θ(a) 6= a. Let X =⋃∞

i=1 an(Σ \θ(a))ian for a fixed integer n ≥ 1. Then X and X+ are both θ-prefix and θ-suffix. The entropy of X+ is such that log(p− 1) < h(X+) < log(p).

In the case of the DNA alphabet, p = 4 and for n = 1 the above characteristicequation becomes µ3− 3µ2− 3 = 0. The largest real value of µ is approximately3.27902 which means that the entropy of X+ is greater than log 2.

Proposition 30. Let a, b ∈ Σ be such that for all θ(a) 6= θ(b) 6= a 6= b. LetX =

⋃∞i=1 anΣibn for a fixed integer n ≥ 1. Then X and X+ are θ-bifix and

θ-outfix. The entropy of X+ is such that log(p− 1) < h(X+) < log(p).

In the case of the DNA alphabet, p = 4 and for n = 1 the above characteristicequation becomes µ3− 4µ2− 4 = 0. The largest real value of µ is approximately4.22417 which means that the entropy of X+ is greater than log 2.

Proposition 31. Choose distinct a, b, c ∈ Σ such that θ(a) 6= b, c, θ(a) 6= a.Let X =

⋃∞i=1 an(Σn−1c)ibn for some n ≥ 2. Then X and so X+ are strictly θ-

intercodes of index m for all m ≥ 1. The entropy of X+ is such that log(pn−1

n ) <

h(X+) < log((pn−1 + 1)1n ).

For the DNA alphabet, p = 4, and for n = 2, the above characteristic equa-tion becomes µ6−4µ4−4 = 0. Solving for µ, the largest real value of µ is 2.05528.Hence the entropy of X+ is greater than log 2.

Example 1. Consider ∆ and θ = ρν and let n = 2, a = A, c = C, b = G.Then X =

⋃∞i=1 AA(∆C)iGG and X+ are strictly θ-intercodes of index m for

all m ≥ 1.

Page 11: DNA Codes and Their Properties

7 Concluding remarks

In this paper we investigated theoretical properties of languages that avoidedcertain type of undesirable Watson-Crick bindings; θ-outfix codes, θ-intercodes,n-θ-intercodes and n-θ-comma-free codes. All these new concepts generalize clas-sical notions of outfix codes, intercodes, n-intercodes and n-comma-free codesrespectively. In addition, DNA word sets that are θ-outfix codes or θ-intercodesare of interest in the design of DNA computing experiments since such sets avoidunwanted hybridization Fig 1 and Fig 2. This paper investigates algebraic prop-erties of such codes. We also developed certain methods to construct such sets ofDNA code words with good properties and have calculated their informationalentropy.

Acknowledgment This work has been supported by NSERC and Canada Re-search Chair Grant for Lila Kari.

References

1. R.L. Adler, D. Coppersmith and M. Hassner, Algorithms for sliding block codes-an application of symbolic dynamics to information theory, IEEE Trans. Inform.Theory 29 (1983): 5-22.

2. E.B. Baum, DNA Sequences useful for computation unpublished article (1996).

3. R.S.Braich, N.Chelyapov, C.Johnson, P.W.K.Rothemund, L.Adleman, Solutionof a 20-variable 3-SAT problem on a DNA computer Science, Science 19, Vol296(5567) (2002) 499-502.

4. J. Berstel, D. Perrin, Theory of Codes, Academis Press, Inc. Orlando Florida, 1985.

5. R.Deaton, J.Chen, H.Bi, M.Garzon, H.Rubin, D.F.Wood, A PCR based protocolfor In vitro selection of non-crosshybridizing oligonucleotides, DNA Computing:Proceedings of the 8th International Meeting on DNA Based Computers (M.Hagiya,A.Ohuchi editors), Springer LNCS 2568 (2003) 196-204.

6. R.Deaton, J.Chen, M.Garzon, J.Kim, D.Wood, H.Bi, D.Carpenter, Y.Wang, Char-acterization of Non-Crosshybridizing DNA Oligonucleotides Manufactured in Vitro,DNA computing: Preliminary Proceedings of the 10th International Meeting onDNA Based COmputers (C.Ferretti, G.Mauri, C.Zandron editors) June7-10, (2004)132-141.

7. R. Deaton et. al, A DNA based implementation of an evolutionary search for goodencodings for DNA computation, Proc. IEEE Conference on Evolutionary Compu-tation ICEC-97, (1997) 267-271.

8. D. Faulhammer, A. R. Cukras, R. J. Lipton, L. F.Landweber, Molecular Compu-tation: RNA solutions to chess problems, Proceedings of the National Academy ofSciences, USA, 97 4 (2000) 1385-1389.

9. M. Garzon, R. Deaton, D. Reanult, Virtual test tubes: a new methodology for com-puting, Proc. 7th. Int. Symposium on String Processing and Information retrieval,A Coruna, Spain. IEEE Computing Society Press (2000) 116-121.

10. T. Head, Formal language theory and DNA: an analysis of the generative capacityof specific recombinant behaviors, Bull. Math. Biology vol. 49 (1987) 737-759.

Page 12: DNA Codes and Their Properties

11. T. Head, Gh. Paun, D. Pixton, Language theory and molecular genetics, in Hand-book of formal languages, Vol.II (G. Rozenberg, A. Salomaa editors) SpringerVerlag (1997) 295-358.

12. S. Hussini, L. Kari, S. Konstantinidis, Coding properties of DNA languages, DNAComputing: Proceedings of the 7th International Meeting on DNA Based Com-puters (N. Jonoska, N.C. Seeman editors), Springer LNCS 2340 (2002) 57-69.

13. N. Jonoska, D. Kephart, K. Mahalingam, Generating DNA code words CongressusNumernatium 156 (2002): 99-110.

14. N. Jonoska and K. Mahalingam, Languages of DNA based code words Proceedingsof the 9th International Meeting on DNA Based Computers, J.Chen, J.Reif editors,Springer LNCS 2943(2004): 61-73.

15. N.Jonoska and K.Mahalingam, Methods for constructing coded DNA languages As-pects of Molecular Computing, N.Jonoska, G.Paun, G.Rozenberg editors, SpringerLNCS 2950 (2004): 241-253.

16. N.Jonoska , K.Mahalingam and J.Chen, Involution Codes: With Application toDNA Coded Languages, Natural Computing, Vol 4-2(2005), 141-162.

17. L. Kari, S. Konstantinidis, E. Losseva and G. Wozniak, Sticky-free and overhang-free DNA languages, Acta Informatica 40 (2003): 119-157.

18. L.Kari, S.Konstantinidis and P.Sosik, Bond-free Languages: Formalizations, Max-imality and Construction Methods, Preliminary Proceedings of DNA 10 June 7-10(2004):16-25.

19. L.Kari, S.Konstantinidis, P.Sosik, Preventing Undesirable Bonds between DNACodewords Preliminary Proceedings of DNA 10 June 7-10 (2004) 375-384.

20. M.S. Keane, Ergodic theory an subshifts of finite type , Ergodic theory, symbolicdynamics and hyperbolic spaces (ed. T.Edford, et.al.) Oxford Univ. Press, Oxford(1991): 35-70.

21. D.Kephart and J.Lefevre, Codegen: The generation and testing of DNA code words,Proceedings of IEEE Congress on Evolutionary Computation, June (2004): 1865-1873.

22. P.Leupold, Partial Words for DNA Coding Preliminary Proceedings of DNA 10June 7-10 (2004) 26-35.

23. Z. Li, Construct DNA code words using backtrack algorithm, preprint.24. D. Lind and B. Marcus, An introduction to Symbolic Dynamics and Coding, Cam-

bridge University Press, Inc. Cambridge United Kingdom (1999).25. Q. Liu et al., DNA computing on surfaces, Nature 403 (2000) 175-179.26. K.Mahalingam, Involution codes: with application to DNA strand design, Doctoral

Dissertation in Mathematics, University of South Florida, Tampa, FL.27. A. Marathe, A.E. Condon, R.M. Corn, On combinatorial word design Preproceed-

ings of the 5th International Meeting on DNA Based Computers, Boston (1999)75-88.

28. H.J.Shyr, Free Monoids and Languages, Hon Min Book Company 2001.

lila,[email protected]