dna (1)

DNA CRYPTOGRAPHY

NIT KURUKSHETRA 1

Chapter 1Introduction

DNA CRYPTOGRAPHY

1 Introduction

1.1 DNA Cryptography

DNA cryptography is a new born cryptographic field emerged with the research of DNA

computing, in which DNA is used as information carrier and the modern biological technology

is used as implementation tool. The vast parallelism and extraordinary information density

inherent in DNA molecules are explored for cryptographic purposes such as encryption,

authentication, signature, and so on.

1.2 DNA

DNA is the abbreviation for deoxyribonucleic acid which is the germ plasm of all life styles.

DNA is a kind of biological macromolecule and is made of nucleotides. Each nucleotide

contains a single base and there are four kinds of bases, which are adenine (A) and thymine (T)

or cytosine (C) and guanine (G), corresponding to four kinds of nucleotides. A single-stranded

DNA is constructed with orientation: one end is called 5′, and the other end is called 3′. Usually

DNA exists as double-stranded molecules in nature. The two complementary DNA strands are

held together to form a double-helix structure by hydrogen bonds between the complementary

bases of A and T (or C and G).

Fig 1.2.1 Double helix structure of DNA

NIT KURUKSHETRA 2

DNA CRYPTOGRAPHY

1.3 Amino Acid Codes

Amino Acid Name Amino Acid Code Nucleotide Codon

Alanine A GCT GCC GCA GCG

Arginine R CGT CGC CGA CGG AGA AGG

Asparagine N ATT AAC

Aspartic acid (Aspartate) D GAT GAC

Cysteine C TGT TGC

Glutamine Q CAA CAG

Glutamic acid (Glutamate) E GAA GAG

Glycine G GGT GGC GGA GGG

Histidine H CAT CAC

Isoleucine I ATT ATC ATA

Leucine L TTA TTG CTT CTC CTA CTG

Lysine K AAA AAG

Methionine M ATG

Phenylalanine F TTT TTC

Proline P CCT CCC CCA CCG

Serine S TCT TCC TCA TCG AGT AGC

Threonine T ACT ACC ACA ACG

Tryptophan W TGG

Tyrosine Y TAT, TAC

Valine V GTT GTC GTA GTG

Asparagine or Aspartic acid (Aspartate) B Random codon from D and N

Glutamine or Glutamic acid (Glutamate) Z Random codon from E and Q

Unknown amino acid (any amino acid) X Random codon

Translation stop * TAA TAG TGA

Gap of indeterminate length - ---

Unknown character (any character or symbol not in table) ? ???

Table 1.3.1 Amino acids and codes

1.4 Primer

A primer is a short synthetic oligonucleotide which is used in many molecular techniques

from PCR to DNA sequencing. These primers are designed to have a sequence which is the

NIT KURUKSHETRA 3

DNA CRYPTOGRAPHY

reverse complement of a region of template or target DNA to which we wish the primer to

anneal.

Some thoughts on designing primers

1. primers should be 17-28 bases in length;

2. base composition should be 50-60% (G+C);

3. primers should end (3') in a G or C, or CG or GC: this prevents "breathing" of ends

and increases efficiency of priming;

4. Tms between 55-80oC are preferred;

5. 3'-ends of primers should not be complementary (ie. base pair), as otherwise primer

dimers will be synthesised preferentially to any other product;

6. primer self-complementarity (ability to form 2o structures such as hairpins) should be

avoided;

7. runs of three or more Cs or Gs at the 3'-ends of primers may promote mispriming at G

or C-rich sequences (because of stability of annealing), and should be avoided.

1.5 Transcription and Translation

Transcription, or RNA synthesis, is the process of creating an equivalent RNA copy of a

sequence of DNA. Both RNA and DNA are nucleic acids, which use base pairs of nucleotides as

a complementary language that can be converted back and forth from DNA to RNA in the

presence of the correct enzymes. During transcription, a DNA sequence is read by RNA

polymerase, which produces a complementary, anti-parallel RNA strand. As opposed to DNA

replication, transcription results in an RNA complement that includes uracil (U) in all instances

where thymine (T) would have occurred in a DNA complement.

Translation is the first stage of protein biosynthesis (part of the overall process of gene

expression). Translation is the production of proteins by decoding mRNA produced

in transcription. Translation occurs in the cytoplasm where the ribosomes are located.

Ribosomes are made of a small and large subunit which surrounds the mRNA. In

translation, messenger RNA (mRNA) is decoded to produce a specific polypeptide according to

the rules specified by the genetic code. This uses an mRNA sequence as a template to guide the

synthesis of a chain of amino acids that form a protein. Many types of transcribed RNA, such as

NIT KURUKSHETRA 4

DNA CRYPTOGRAPHY

transfer RNA, ribosomal RNA, and small nuclear RNA are not necessarily translated into an

amino acid sequence.

1.6 Cryptography

Data security and cryptography are critical aspects of conventional computing and may also be

important to possible DNA database applications. Here we provide basic terminology used in

cryptography. The goal is to transmit a message between a sender and receiver such that an

eavesdropper is unable to understand it. Plaintext refers to a sequence of characters drawn from

a ¯nite alphabet, such as that of a natural language. Encryption is the process of scrambling the

plaintext using a known algorithm and a secret key. The output is a sequence of characters

known as the ciphertext. Decryption is the reverse process, which transforms the encrypted

message back to the original form using a key. The goal of encryption is to prevent decryption

by an adversary who does not know the secret key. An unbreakable cryptosystem is one for

which successful cryptanalysis is not possible. Such a system is the one-time-pad cipher. It gets

its name from the fact that the sender and receiver each possess identical notepads ¯lled with

random data. Each piece of data is used once to encrypt a message by the sender and to decrypt

it by the receiver, after which it is destroyed.

* The main goal of the research of DNA cryptography is exploring characteristics of DNA

molecule and reaction, establishing corresponding theories, discovering possible development

directions, searching for simple methods of realizing DNA cryptography, and lay-ing the basis

for future development.

1.7 Advantages Of DNA Cryptography

The difficult biological problem referred to here is “It is extremely difficult to amplify the

message-encoded sequence without knowing the correct PCR two primer pairs”. Polymerase

Chain Reaction (PCR) is a fast DNA amplification technology based on Watson-Crick

complementarity, and is one of the most important inventions in modern biology. Two

complementary oligonucleotide primers are annealed to double-stranded target DNA strands,

and the necessary target DNA can be amplified after a serial of polymerase enzyme. The PCR is

a very sensitive method, and a single target DNA molecule can be amplified to 106 after 20

cycles in theory. Thus one can effectively amplify a lot of DNA strands within a very short time.

NIT KURUKSHETRA 5

DNA CRYPTOGRAPHY

Thinking about the highly stability of PCR, each PCR primer (20-27)-mer nucleotides long is a

comparatively perfect selection. In this study, we selected each PCR primer 20-mer nucleotides

long. It is a special function in PCR amplification that having the correct primer pairs. It would

still be extremely difficult to amplify the message-encoded sequence without knowing the

correct two primer pairs. If an adversary without knowing the correct two primer pairs wants to

pick out the message encoded sequence by PCR amplification, he must choose two primer

sequences from about 10^23 kinds of sequences (the number of combination taking 2 sequences

from 420 candidates). So, we believe that this biological problem is difficult and will last a

relatively long time.

1.8 Limitations Of DNA Cryptography

(i) Lack of the related theoretical basis.

(ii) Difficult to realize and expensive to apply.

1.9 Comparisons among DNA cryptography, traditional cryptography and

quantum cryptography

1.9.1 Development

Traditional cryptography can be traced back to Caesar cipher 2000 years ago or even earlier.

Related theory is almost sound. All the practical ciphers can be seen as traditional ones.

Quantum cryptography came into being in the 1970s, and the theory basis has been prepared

while implementation is difficult. By and large, they have not been plunged into practical use.

DNA cryptography has only nearly ten years history, the theory basis is under research and the

application costs very much.

1.9.2 Security

Only computational security can be achieved for traditional cryptographic schemes except for

the one-time pad, that is to say, an adversary with infinite power of computation can break them

theoretically. It is shown that quantum computers have great and striking computational

potential. Although there is uncertainty about the computational power of quantum computers, it

is possible that all the traditional schemes except for the one-time pad can be broken by using

the future quantum computers. Quantum cryptographic schemes are unbreakable under current

theories. Differently, their security is based on Heinsberg's Uncertainty Principle. Even if an

eavesdropper is given the ability to do whatever he wants, and has infinite computing re-sources,

NIT KURUKSHETRA 6

DNA CRYPTOGRAPHY

so much as P=NP, it is still impossible to break such a scheme. Any behavior of eavesdropping

will change the cipher so it can be detected. It is impossible for an adversary to obtain a totally

same the quanta with the intercepted one, thus the attempt to tamper but without being detected

in vain. Therefore, quantum key agreement schemes have unconditional security. For the DNA

cryptography, the main security basis is the restriction of biological techniques, which has

nothing to do with the computing power and immunizes DNA cryptographic schemes against

attacks using quantum computers. Nonetheless, the problem as to what is the extent this kind of

security and how long it can be maintained it is still under exploration.

1.9.3 Application

Traditional cryptosystems are the most convenient of which the computation can be executed by

electronic, quantum as well as DNA computers, the data can be transmitted by wire, fiber,

wireless channel and even by a messenger, and the storage can be CDs, magnetic medium, DNA

and other storage medium. Using the traditional cryptography we can realize purposes as public

and private key encryption, identity authentication and digital signature. Quantum cryptosystem

is implemented on quantum channels of which main ad-vantage lies in real-time communication.

The disadvantage lies in the secure data storage, which makes it infeasible to implement public-

key encryption and digital signature as easily as traditional one does. Under the current level of

techniques, only by physical ways can the cipher text of DNA cryptography be transmitted. Due

to the vast parallelism, exceptional energy efficiency and extraordinary information density

inherent in DNA molecules, DNA cryptography can have special advantages in some

cryptographic purposes, such as secure data storage, authentication, digital signature,

steganography, and so on. DNA can even be used to produce unforgeable contract, cash ticket

and identification card.

Researches of all the three kinds of cryptography are still in progress, and a great many

problems remains to be solved especially for DNA and quantum cryptography, this making it

hard to predict the future. But from the above discussions we think it is likely that they exist and

develop conjunctively and complement each other rather than one of them falls into disuse

thoroughly.

1.10 Development directions of DNA cryptography

NIT KURUKSHETRA 7

DNA CRYPTOGRAPHY

Since DNA cryptography is still in its immature stage, it is too early to predict the future

development precisely. However, in view of the development of biological techniques and the

requirement of cryptography, we hold the following opinions:

1) DNA cryptography should be implemented by using modern biological techniques as tools

and biological hard problems as main security basis to fully exert the special advantages.

Encryption and decryption are procedures of data transform which, if described by mathematical

methods, are easier to be implemented than physical and chemical ones in the present era of

electronic computers and the Internet. If other kinds of cryptosystems are necessary to be

researched and developed, they should have properties such as higher security levels and storage

density etc, which cannot be realized by electronic computers by using mathematical methods.

Thus, if DNA cryptography is necessary to be developed, the advantages inherent in DNA

should be fully explored, such as developing nanoscopic storage based on the tiny volume of

DNA, realizing fast encryption and decryption based on the vast parallelism, and utilizing

difficult biological problems that one can utilize but still far from fully understand them as the

secure foundation of DNA cryptography to realize novel crypto-system which can resist the

attack from quantum com-puters. Since it has not been made sure whether quan-tum computers

threaten the hardness of various mathematical hard problems, these problems being se-curity

basis cannot be excluded absolutely. Encryption and decryption algorithms hard to be

implemented using electronic computers may be feasible using DNA ones with regard to their

vast parallel computational ability. If these schemes withstand attacks by quantum computers,

their computational security will be inherited into DNA schemes. Thereby, DNA cryptography

does not absolutely repulse traditional cryptography and it is possible to construct a hybrid

cryptosystem of them.

2) Security requirements :Regardless of the many differences between DNA and traditional

cryptography, they both satisfy the same characteristic of cryptography. The communication

model for DNA encryption is also made up of two par-ties, i.e. a sender and a receiver, which

obtain the secret key in a secure or authenticated way and then communicate securely with each

other in an insecure or unauthenticated channel. The security requirements should also be

founded upon the assumption proposed by Kirchoff that security should depend only on the

secrecy of decryption key; that is, an attacker should be fully aware of all the details of

encryption and decryption except the decryption key. It is under this assumption that a

cryptosystem can be said secure when any attacker cannot break it. More precisely, it must be

NIT KURUKSHETRA 8

DNA CRYPTOGRAPHY

assumed that an attacker knows the basic biological method the designer used, and has enough

knowledge and excellent laboratory devices to repeat the de-signer’s operations. The only thing

not known by the attacker is the key. In a DNA cryptosystem, a key is usually some substances

of biological materials or a preparation flow, and sometimes the experiment conditions.

3) For DNA cryptography, the current research target should lie first in security and feasibility,

second in storage density.

A sound cryptosystem should be secure as well as easy to be implemented. The development of

modern biological technology makes it possible to express data by DNA, although the related

research is just in its initial stage. In fact, it is still difficult to operate the nanoscopic DNA

directly. Scientists can easily operate DNA with the aid of kinds of restriction enzymes only

after DNA strands are amplified with amplification technology such as PCR. With the current

technology, it is also impossible to store all the worldwide data by using several grams of DNA.

If the only requirement is to improve the density of storage, it is hard to implement DNA

cryptography at the present technique level.

It is more practical to make use of colony property of plentiful DNA for cryptographer. For

example, store data by DNA chips and read data by hybridization, which makes the operations

of input/output faster and more convenient. The method is easier to be implemented than

encoding message into nucleotides directly while the storage density is somewhat lower.

4) Currently, the main task for DNA cryptographers is to establish the theory foundations and to

accumulate the practical experience.

It can be proved that there are vast parallelism, exceptional energy efficiency and extraordinary

information density inherent in DNA. This motivates the research of DNA computing and

cryptography. The cur-rent goal or difficulty is to find and make use of the utmost potential, but

the related research is in its initial stage. Sound theories have not been founded for both DNA

computing and cryptography. Modern biology lays particular stress on experiments rather than

theories. There is no efficient way to measure the hardness of a biological problem and the

security level of the corresponding cryptosystems based on the problem. It is certainly urgent to

find such a method similar to computational complexity. Presently, the most important is to find

the sound properties of DNA that can be used to computation and encryption, to establish the

theoretical basis and to accumulate the experience, based on which the design of secure and

practical DNA cryptosystems is possible.

NIT KURUKSHETRA 9

DNA CRYPTOGRAPHY

1.11 DNA Digital Coding Technology

In the information science, the most fundamental coding method is binary digital coding, which

is anything can be encoded by two state 0 or 1 and a combination of 0 and 1. There are four

kinds of bases, which are adenine (A) and thymine (T) or cytosine (C) and guanine (G) in

DNA sequence. The simplest coding patterns to encode the 4 nucleotide bases (A, T, C, G) is by

means of 4 digits: 0(00), 1(01), 2(10), 3(11). Obviously, there are 4!=24 possible coding patterns

by this encoding format. As we all know, in a double helix DNA string, two DNA strands are

held together complementary in terms of sequence, that is A to T and C to G according to

Watson-Crick complementarity rule. Take DNA digital coding into account, it should reflect the

biological characteristics of 4 nucleotide bases, the complementary rule that (~0)=1, and (~1=0)

is proposed in this DNA digital coding. According to this complementary rule, that is 0(00) to

3(11) and 1(01) to 2(10). So among these 24 patterns, only 8 kinds of patterns (0123/CTAG,

0123/CATG, 0123/GTAC, 0123/GATC, 0123/TCGA, 0123/TGCA, 0123/ACGT, 0123/AGCT)

which are topologically identical fit the complementary rule of the nucleotide bases. It is

suggested that the coding pattern in accordance with the sequence of molecular weight,

0123/CTAG, is the best coding pattern for the nucleotide bases. This pattern could perfect

reflect the biological characteristics of 4 nucleotide bases and have a certain biological

significance. The binary digital coding of DNA sequences prevails over the character DNA

coding with the following advantages:

(1). To decrease the redundancy of the information coding andimprove the coding efficiency

compared to the traditional character DNA coding.

(2). The digital coding of DNA sequence is very convenient for mathematical operation and

logical operation and may give a great impact on the DNA bio-computer.

(3). The DNA sequence after preprocessing by DNA digital coding techniques is able to do

digital computing and adapt to the existing computer-processing mode, which facilitates the

direct conversion between biological information and encryption information in the

cryptographyscheme.

(4). By using the technology of DNA digital coding, the traditional encryption method such as

DES or RSA could be used to preprocess to the plaintext in the cryptography scheme.

1.12 System Design Of Encryption Scheme

NIT KURUKSHETRA 10

DNA CRYPTOGRAPHY

Now, we will describe the system design of encryption scheme, whose security on the scheme is

mainly based on the difficult biological problems and difficult mathematical problems. We will

show the way of exchanging message safely just between specific two persons. We shall call the

sender Alice, and the intended receiver Bob. Above all, we extend the definition of this

encryption scheme as follows. Suppose there is a sender Alice who owns an encryption key KA,

and an intended receiver Bob who owns a decryption key KB (KA = KB or KA ≠ KB). Alice

uses KA to translate a plaintext M into ciphertext C by a translation E. Bob uses KB to translate

the ciphertext C into the plaintext M by a translation D.

The encryption process is:

C = EKA (M)

The decryption process is:

DKB (C) = DKB (EKA (M)) = M

It is difficult to obtain M from C unless one has KB. We call translation E as encryption process

and C as ciphertext. Here, KA, KB and C are not limited to digital data, but can be any method,

material, data, etc. such as DNA sequence. E and D are also not limited to mathematical

calculations, but can be any physical or chemical or biological or mathematical process such as

traditional encryption method. Using traditional cryptography RSA to preprocess to the

plaintext, an encryption scheme with DNA technologies was proposed in this paper. The

intended receiver Bob has a pair of keys (e, d). We will describe the general process of the

encryption scheme as follows.

A. Key Generation

The message-sender Alice designs a DNA sequence which is 20-mer oligo nucleotides long as a

forward primer for PCR amplification and transmits it to intended receiver Bob over a secure

channel. The message-receiver Bob also designs a DNA sequence which is 20-mer oligo

nucleotides long as a reverse primer for PCR amplification and transmits it to Alice over a

secure channel. After a pair of PCR primers is respectively designed and exchanged over a

secure communication channel, we can get an encryption key KA that is a pair of PCR primers

and Bob’s public key e, as well as an decryption key KB that is a pair of PCR primers and Bob’s

secret key d.

B. Encryption

First of all, the sender Alice will translate the plaintext M into hexadecimal code by using the

built-in computer code. Then hexadecimal code is translated into binary plaintext M_ by using

third-party software. Finally, Alice translates the binary plaintext M_ into the binary ciphertext

NIT KURUKSHETRA 11

DNA CRYPTOGRAPHY

C_ by using Bob’s public key e. We call this preprocess operation is pretreatment data process

(data pre-treatment). Through this preprocess operation, we can get completely different

ciphertext from the same plaintext, which can effectively prevent attack from a possible word as

PCR primers. Then, Alice translates the binary ciphertext C_ into the DNA sequence according

to the DNA digital coding technology. After coding, Alice synthesizes the secret-message DNA

sequence which is flanked by forward and reverse PCR primers, each 20-mer oligo nucleotides

long. Thus, the secrete-message DNA sequence is prepared. The last process of this encryption

is that Alice generates a certain number of dummies and puts the secrete-message DNA

sequence among them. It is necessary that each dummy has the same structure as the secrete-

message DNA sequence. In this scheme, the dummy is generated by sonicating human DNA to

roughly 60 to 160 nucleotide pairs (average size) and denaturing it. After mixing the secrete-

message DNA sequence with a certain number of dummies, Alice sends the DNA mixture to

Bob using an open communication channel.

C. Decryption

After the intended receiver Bob gets the DNA mixture, he can easily find the secrete-message

DNA sequence. Since the intended receiver Bob had gotten the correct PCR two primer pairs

through a secure way, he could amplify the secret-message DNA sequence by perform PCR on

DNA mixture. After Bob amplifies the secrete-message DNA sequence, he could retrieve the

plaintext M sended from Alice from the reverse preprocess operation using his secret key d. This

decryption process is not only a mathematic computation, but also a biological process. The

pretreatment data flow chart is described in Fig. 1.12.1

Fig.1.12.1 Data pre(post)treatment flow chart

NIT KURUKSHETRA 12

DNA CRYPTOGRAPHY

In the following part of this section, we thoroughly discuss details of this encryption scheme

with an example shown in fig. 1.12.2. The result of the PCR amplification is shown in fig.

1.12.3.

Step 1: Key Generation. The message-sender Alice and the message-receiver Bob respectively

design and exchange a pair of PCR primers over a secure communication channel. The

encryption and decryption keys are a pair of PCR primers. In this scheme, the intended PCR two

primer pairs was not independent designed by sender or receiver, but respectively designed

complete cooperation by sender and receiver. This operation could increase the security of this

encryption scheme, because even if an adversary somehow caught one of a primer pair, the

amplification was not efficient when one of a primer pair is incorrect, only when both of the

primer sequences were correct, the amplification could be successful.

Step 2: Data pretreatment. Here we choose “GENECRYPTOGRAPHY” (gene cryptography) as

plaintext to encrypt. We first convert this sentence into hexadecimal code by using the built-in

computer code, that is: “47 45 4E 45 43 52 59 50 54 4F 47 52 41 50 48 59”. Then we translate

hexadecimal code into binary plaintext M_ by using third-party software, that is:

01000111 01000101 01001110 01000101

01000011 01010010 01011001 01010000

01010100 01001111 01000111 01010010

01000001 01010000 01001000 01011001

NIT KURUKSHETRA 13

DNA CRYPTOGRAPHY

Fig. 1.12.2. Flow chart of Encryption scheme system.

Fig. 1.12.3. Result of the PCR amplification

Step 3: Encryption. Alice will encrypt the binary plaintext M_ into the binary ciphertext C_ by

using Bob’s public key e. After that, Alice converts the binary ciphertext C_ into the DNA

sequence by using the DNA digital coding technology. Finally, a secret-message DNA sequence

containing an encoded message 64 nucleotides long flanked by forward and reverse PCR

primers. Thus, the secrete-message DNA is prepared. After mixing the secrete-message DNA

sequence with a certain number of dummies, Alice sends the DNA mixture to Bob using an open

communication channel, such as DNA ink or DNA book.

Step 4: Decryption. After the intended receiver Bob gets the DNA mixture, he can easily pick

out the secret-message DNA sequence by using the correct primer pairs. Bob translates the

NIT KURUKSHETRA 14

DNA CRYPTOGRAPHY

secret-message DNA sequence into the binary ciphertext C_ by using the DNA digital coding

technology. Then, Bob can decrypt the binary ciphertext C_ into the binary plaintext M_ by

using his secret key e.

Step 5: data post-treatment. After the binary plaintext M_ has been recovered, Bob can retrieve

the plaintext M, “GENECRYPTOGRAPHY” from the binary plaintext M_ by using data post-

treatment.

1.13 The codes

The three codes described in detail in this paper are referred to as the Huffman code, the comma

code and the alternating code. It should be stated at the outset that none of them fulfill all the

criteria listed above. The Huffman code is the most economical and would be the best for

encrypting text for short-term storage, providing that this text lacked any sort of punctuation,

symbols or numbers. Both the comma code and the alternating code, while the most

uneconomical of the codes, have the advantage that they generate base sequences which are

obviously artificial, and so would be best suited to the encryption of information for long-term

storage.

1.13.1 The Huffman code

By varying the number of symbols allotted to a character in a code, with the most frequent

character being given the least number of symbols and the least frequent the most number of

symbols, it is possible to construct very economical codes, i.e. codes in which the text is

encrypted by the minimum number of symbols – it is as short as it can possibly be. One of the

best ways of constructing an economical code is to use Huffman’s method (Huffman 1952). As

well as being compact, the message generated by a Huffman code is unambiguous. That is, once

the start point has been specified, there is only one way in which the stream of symbols

comprising the message can be read. The Huffman code constructed with the four DNA bases A,

G, C and T for the letters of the English alphabet is shown in Table 1.13.1 Given the frequencies

of occurrence of these letters, such a code is straightforward to construct (Materials and

methods). In the code, the shortest codon is just one base long (representing e, the most

frequently used letter in the English language), and the longest codon is five bases long

(representing q and z, the most infrequent letters in the English language). The average codon

length is 2.2 bases, shorter than the codons of any of the other codes described in this paper. The

NIT KURUKSHETRA 15

DNA CRYPTOGRAPHY

unambiguous nature of the Huffman code shown in Table 1 can be seen by encoding any group

of letters with it and then decoding them from the beginning of the sequence: there is only one

way it can be done. For instance, the base sequence CATGTAGTCG can only be read from the

beginning as hester – no other interpretation of the message is possible. Given a suitable start

signal, the alternating code is also unambiguous. While of the three codes discussed here, the

Huffman makes the most economical use of DNA, it does have two disadvantages. The first is

that it does not cater for any symbols or numbers, as the frequency of these characters will be

heavily text-dependent. Consequently they cannot be included when deriving the Huffman code.

The second disadvantage of the Huffman code relates to its possible use in long-term storage of

information. Because of the variable length of the codons, no obvious pattern emerges when

they are joined together to encode a message. The naive investigator might confuse it with

natural DNA and therefore not appreciate its significance. One could counteract this problem by

using three instead of four bases (e.g. A, C and T), at the expense of economy. The Huffman

code is the only code discussed in this paper with variable length codons. The others all have

fixed length codons. We note that, in a similar manner to the above, the Huffman code has also

been used to construct a ‘perfect’ genetic code comprising variable length codons.

NIT KURUKSHETRA 16

DNA CRYPTOGRAPHY

Table 1.13.1 The Huffman code

1.13.2 The comma code

In the comma code, consecutive 5-base codons are separated by a single base, the comma, which

is always the same: e.g. G− − − − − G− − − − − G− − − −− G. The repetition of G every six

bases must be construed by any careful sequence analyst as a deliberate device. The codons that

slot into the gaps in the above framework are made up of the remaining bases C, A and T, but

not G, e.g. ATCAC. These codons are further restricted to three A:T base pairs and two G:C

base pairs, with the C of the latter always being located in the top strand. This kind of an

arrangement,suggested by unrelated work , has the advantage that it will generate a set of codons

with isothermal melting temperatures, facilitating the construction of message DNA (‘Criteria

for an optimal code’, above). The codons take the general form CWWWC, where W = A or T,

and the C’s and W’s can adopt any arrangement (e.g. WWCWC or WCCWW). There are 80

codons in this set. Most (83%) point mutations give nonsense codons, and therefore the comma

code is good at detecting errors. But the principal attraction of the comma code is the reading

frame established by the regular pattern of repeating G’s. The other two codes described do not

have this advantage, and, unless a start point is specified, it might be difficult to orientate oneself

with respect to the message. With the comma code, the reading frame is clear. Furthermore, it

offers some protection against deletion and insertion mutations, which could further complicate

the interpretation of the other codes. For example, it is not difficult to spot the codon containing

the deletion mutation in the following comma-coded sequence:

GATCACGATTCCGCTATGACTCAG. It should also be noted that the base composition of

the codons will give, when the commas are included, message DNA with the unusual property

of a 1:1 ratio of A:T to G:C base pairs.

1.13.3 The alternating code

The alternating code comprises sixty-four 6-base codons of alternating purines and pyrimidines:

RYRYRY, where R = A or G and Y = C or T (although there is no reason why the purines and

pyrimidines should not alternate YRYRYR, or be fixed in other arrangements such as YYYRRR

or RRYYRY). It is very unlikely that the alternating structure formed by strings of these codons

would go unremarked – even short stretches (8 base pairs) of alternating purines and

pyrimidines have been noted in naturally occurring DNA . As in the comma code, the alternating

structure has the unusual property that, in a given piece of message DNA, the number of G:C

pairs will be the same as the number of A:T pairs. As well as creating message DNA of an

obviously artificial nature, the alternating code has two other advantages of the comma code: it

NIT KURUKSHETRA 17

DNA CRYPTOGRAPHY

is isothermal, and it is error-detecting, but less so than the comma code, since 67% of single

point mutations result in nonsense codons. Like the comma code it does not use DNA

economically. Unlike the comma code, there is no automatic reading frame.

Three possible point mutations can occur at each position of the codon GCWWWC (which

includes the initial comma), and therefore there are 18 single point mutations altogether. Of

these 18 single point mutations, three (17%) will produce sense codons (mutation of an A to a T,

or vice versa) and therefore the remaining 83% of single point mutations will given nonsense

codons.

Table 1.13.2 General features of the codes

Table 1.13.3 Advantages of the codes

1.13.4 Other codes

The three codes detailed above are meant to be illustrative rather than exhaustive. They are by

no means the only codes, or the only types of code, possible. Three others are outlined briefly in

this section. Before experimental data for the nature of the genetic code became available, there

were a number of suggestions as to what form it might take. One of these was the comma-free

code. As the name suggests, a comma-free code is just a comma code without the commas. One

might think that removing the commas would give a code without a reading frame. But, by

restricting oneself to a set of fixed-length codons with particular base combinations, the codons

NIT KURUKSHETRA 18

DNA CRYPTOGRAPHY

in this set can be chosen such that only one reading frame is ever possible – all the others give

nonsense. For instance, the 3-base codons AGG, ACG and GTG are part of a comma free code.

Any combination of these codons will give a sequence which can be read in only one way. For

example, in the sequence ACGGTGGTGACGAGG, one could not begin reading one base in, at

CGG, because CGG does not belong to the set. In their original paper on the subject, showed

that twenty 3-base codons could be selected to act in a comma-free manner. Although twenty

codons is not sufficient to comfortably encrypt text, there is a set of fifty-seven 4-base codons

that would be enough to carry out this task. There is nothing particularly wrong with the comma-

free code as a message-encoding scheme. In fact, since it is quite economical and establishes an

automatic reading frame, it ought to be rather good. However, the only significant clue to the

synthetic nature of message DNA containing text encrypted with a comma-free code would be

the absence of runs of four identical bases (e.g. AAAA), as the comma-free code forbids these.

There are no such absences in natural DNA. Like the alternating and comma codes, the comma-

free code would be error-detecting to a certain extent. One other simple code that should also be

mentioned because it produces DNA that is obviously artificial DNA is one that uses only three

of the four different bases, in a similar manner to the codons of the comma code. In fact,

message DNA has already been constructed with a 3-base codon version of this code. We would

probably use a 4-base codon version of this code, however, to give a larger codon set (34 = 81 as

opposed to 33 = 27). Finally, perhaps the most obvious code of all is one similar to the genetic

code – a triplet code. Codon assignment in this case may be done in a non-random fashion, such

that a degree of error-protection could be achieved, with error-correcting codons representing

symbols with opposite meaning (e.g. CTT to encode for ’<’ and AAG for ’>’).

NIT KURUKSHETRA 19

DNA CRYPTOGRAPHY

Chapter 2

Objective

NIT KURUKSHETRA 20

DNA CRYPTOGRAPHY

2.1 Objective

The aim of our project is to build a system which fulfills the following objectives :

To implement the basic concepts of DNA Cryptography.

Hide the biological complexity involved in basic processing

of DNA cryptography.

Allow users to apply the encoding on textual information.

To obtain an encoded text as desired.

Although many encoding techniques are available in the market this project aims at

understanding the limitations and configurations needed to perform a new technique (DNA

Cryptography) for encoding text.

2.2 Product Perspective

The main purpose or goal of the project is to implement the basic fundamentals of DNA

Cryptography using the Java platform so as to produce an encoding tool capable of applying the

elementary encoding transformations to the text. Added to this it is aimed to obtain a clear

understanding of the Java cryptography and its native API.

Chapter 3NIT KURUKSHETRA 21

DNA CRYPTOGRAPHY

System Requirement Analysis

NIT KURUKSHETRA 22

DNA CRYPTOGRAPHY

3 System Requirement Analysis:

3.1 Characteristics

The important characteristics of the system being developed:

FUNCTIONS

~ Loading the text file from source.

~ Encoding the text using DNA cryptography and PCR

amplifications.

INPUT

~ User input text file for encoder

~ Encoded file for the decoder

OUTPUT

~ A Transformed encoded text for sending to decoder

~ Original text file at decoder

3.2 System Requirements

The following requirements must be fulfilled to run the software on any computer system .

HARDWARE SPECIFICATIONS

Processor Intel Pentium III or higher

MonitorColor Monitor

800 x 600 or higher resolution

AmplifierPCR (Polymerase Chain Reaction)

Amplifier

NIT KURUKSHETRA 23

DNA CRYPTOGRAPHY

SOFTWARE SUPPORT

Operating SystemWindows 9x / XP/ NT / 2000

JVM and JRE installed.

Framework NetBeans 6.0

3.3 Technology Used

Programming Language JAVA 5

3.4 Use Case Diagram

3.4.1 Encoder

Fig 3.4.1 Usecase diagram(encoder)

NIT KURUKSHETRA 24

DNA CRYPTOGRAPHY

3.4.2 Decoder

Fig 3.4.2 Usecase diagram(decoder)

NIT KURUKSHETRA 25

DNA CRYPTOGRAPHY

Chapter 4Project Overview

NIT KURUKSHETRA 26

DNA CRYPTOGRAPHY

4 Project Overview

Fig 4.1 Project overview

The above figure shows the basic components comprising a typical general-purpose system used

for dna cryptography. The functions of each component is as described below.

The computer is a general computer that can range from a PC to a supercomputer. In dedicated

applications sometimes specialized computers are used to achieve the desired level of

performance.

Text File is a user input that has to be encoded.

PCR Amplifier is the hardware component that will be used for converting the text into a

graphical format which reduces the space consumed. It consists of specialized modules that

perform specific tasks.

NIT KURUKSHETRA 27

ComputerText

File

PCR

Amplifier

Network (To Receiver)

DNA CRYPTOGRAPHY

Chapter 5Software Design

NIT KURUKSHETRA 28

DNA CRYPTOGRAPHY

5 Software Design

5.1 Methodology OF Encryption Scheme

The encryption process is:

C. T. = EKA (P.T.)

The decryption process is:

DKB (C.T.) = DKB (EKA (P.T.)) = P.T.

STEPS:

1. Key generation

2. Encryption

3. Decryption

5.2 Flow Diagrams

5.2.1 Encoder

Fig 5.2.1 Flow Diagram(encoder)

NIT KURUKSHETRA 29

DNA CRYPTOGRAPHY

5.2.2 Decoder

Fig 5.2.2 Flow Diagram(decoder)

NIT KURUKSHETRA 30

DNA CRYPTOGRAPHY

5.3 Class Diagrams

5.3.1 Encoder

Fig 5.3.1 Class diagram(encoder)

NIT KURUKSHETRA 31

DNA CRYPTOGRAPHY

5.3.2 Decoder

Fig 5.3.2 Class diagram(decoder)

NIT KURUKSHETRA 32

DNA CRYPTOGRAPHY

5.3.3 KeyGen

Fig 5.3.3 Class diagram(keyGen)

NIT KURUKSHETRA 33

DNA CRYPTOGRAPHY

Chapter 6Software Testing

NIT KURUKSHETRA 34

DNA CRYPTOGRAPHY

6 Testing

6.1 Testing MethodologySoftware testing is critical element of software quality assurance and represents the ultimate

review of specification, design and coding. It is used to detect errors. Testing is a dynamic

method for verification and validation, where the system to be tested is executed and the

behavior of the system is observed.

6.2 Testing Objectives

1. Testing is a process of executing a program with the intent of finding an error.

2. A good test case is one that has a high probability of finding an as-yet-

undiscovered error.

3. A successful test is one that uncovers an as-yet-undiscovered error.

4. The above objectives imply a dramatic change in viewpoint. They move counter

to the commonly held view that a successful test is one in which no errors are

found. Our objective is to design tests that systematically uncover different

classes of errors and do so with a minimum amount of time and effort.

6.3 Testing Technique

The techniques followed throughout the testing of the system are as follows:

6.3.1 Black-Box TestingBlack box testing focuses on the functional requirements of the software. That is, Black Box

testing enables the software engineer to derive sets of input conditions that will fully exercise all

functional requirements for a program. Black Box Testing is not an alternative to white-box

techniques. Rather, it is a complementary approach that is likely to uncover a different class of

errors than white-box methods.Black-Box Testing attempts to find errors in the following

categories:

Incorrect or missing functions.

NIT KURUKSHETRA 35

DNA CRYPTOGRAPHY

Interface errors.

Errors in data structures or external data base access.

Performance errors.

Initialization and termination errors.

* Unlike White Box Testing, which is performed early in the testing process, Black Box Testing

tends to be applied during later stages of testing. Because Black Box Testing purposely

disregards control structure, attention is focused on the information domain. Tests are designed

to answer the following questions:

How is functional validity tested?

What classes of input will make good test cases?

Is the system particularly sensitive to certain input values?

How are the boundaries of a data class isolated?

What data rates and data volume can the system tolerate?

What effect will specific combinations of data have on system operation?

By applying black box techniques, we derive a set of test cases that satisfy the following criteria:

Test cases that reduce, by a count that is greater than one, the number of

additional test cases that must be designed to achieve reasonable testing, and

Test cases that tell us something about the presence or absence of classes of

errors, rather than errors associated only with the specific test at hand.

6.3.2 White-Box Testing

White Box Testing knowing the internal workings of a product tests can be conducted to ensure

that internal operations are performed according to specifications and all internal components

have been adequately exercised.

Using white box testing methods the test cases that can derived are:

All independent paths with in a module have been exercised at least once.

Exercise all logical decisions on their true and false sides.

NIT KURUKSHETRA 36

DNA CRYPTOGRAPHY

Execute all loops at their boundaries and within their operational bounds.

Exercise internal data structures to ensure their validity.

6.3.3 Control Structure Testing

6.3.3.1 Condition Testing

Condition testing is a test case design method that exercises the logical conditions

contained in a program module. If a condition is incorrect then at least one component of the

condition is incorrect. Therefore types of errors in a condition include the following

Boolean operator error

Boolean variable error

Boolean parenthesis error

Relational operator error

Arithmetic expression error

6.3.3.2 Loop Testing

Loops are the corner stone for the vast majority of all algorithms implemented in software. Loop

testing is a white-box testing technique that focuses exclusively on the validity of loop

constructs. Four different classes of loops:

Simple Loops

Nested Loops

Concatenated Loops

Unstructured Loops

6.3.3.3 Dataflow Testing

The dataflow testing method selects test paths of a program according to the location of

definitions and uses of variables in the program. In this testing approach, assume that each

statement in a program is assigned a unique statement number and that each function does not

modify its parameters or global variables.

It is useful for selecting test paths of a program containing nested if and loop statement. This

approach is effective for error detection. However, the problems of measuring test coverage and

NIT KURUKSHETRA 37

DNA CRYPTOGRAPHY

selecting test paths for data flow testing are more difficult than the corresponding problems for

condition testing.

6.4 Testing Strategies

A strategy for software testing integrates software test case design methods into a well planned

series of steps that result in the successful construction of software. A software testing strategy

should be flexible enough to promote a customized testing approach.

6.4.1 Unit Testing

Unit testing focuses verification efforts on the smallest unit of software design. It is white box

oriented. Unit testing is essentially for verification of the code produced during the coding phase

and hence the goal is to test the internal logic of the module. Others consider a module for

integration and use only after it has been unit tested satisfactorily.

The module interface is tested to ensure that information properly flows in and

out of program.

Local data structure is examined to ensure that data stored temporarily maintain

its integrity.

Boundary conditions are tested to ensure that modules operate properly at

boundary limits of processing.

All independent paths are exercised to ensure all statements in a module have

been executed at least once.

All error-handling paths are tested.

6.4.2 Integration Testing Integration testing focuses on design and construction of the software architecture. For example:

- We followed a systematic technique for constructing the program structure that is “putting

them together”- interfacing at the same time conducting tests to uncover errors. We took unit

tested components and build a program that has been dictated by design.

NIT KURUKSHETRA 38

DNA CRYPTOGRAPHY

6.4.3 Validation Testing It is achieved through a series of Black Box tests. An important element of validation process is

configuration review. It is intended for all the elements are properly configured and cataloged. It

is also called AUDIT.

6.4.4 System Testing The last high-order testing step falls outside the boundary of software engineering and into tile

broader context of computer system engineering. Software, once validated, must be combined

with other system element (e.g., hardware, people, and database).System testing verifies that all

elements mesh properly and that overall system function/performance is achieved.

It is a series of different tests whose primary purpose is to fully exercise the computer-based

system. Although each test has a different purpose all work to verify that system elements have

been properly integrated and perform allocated functions.

NIT KURUKSHETRA 39

DNA CRYPTOGRAPHY

NIT KURUKSHETRA 40

Chapter 7Project Snapshots

DNA CRYPTOGRAPHY

7.1 Text file

Fig 7.1 Snapshot(original text)

NIT KURUKSHETRA 41

DNA CRYPTOGRAPHY

7.2 Encoded file

Fig 7.2 Snapshot(encoded text)

NIT KURUKSHETRA 42

DNA CRYPTOGRAPHY

7.3 Decoded file

Fig 7.3 Snapshot(decoded text)

NIT KURUKSHETRA 43

DNA CRYPTOGRAPHY

Chapter 8Conclusion

NIT KURUKSHETRA 44

DNA CRYPTOGRAPHY

8 Conclusion

The main purpose or goal of the project was to study and implement the basic fundamentals of

DNA cryptography on textual information. This project provides an insight into the various

details of the DNA and its use in cryptography purposes. This project provided us with an

opportunity to analyse and practice all the phases of the Software Development Life Cycle.

NIT KURUKSHETRA 45

DNA CRYPTOGRAPHY

Chapter 9Future Prospects & Enhancements

NIT KURUKSHETRA 46

DNA CRYPTOGRAPHY

9 Future Prospects and Enhancements

This project can be extended to encrypt other data formats.

The space complexity can be reduced by practical usage of PCR Amplifier.

Ongoing researches could be used for the future enhancement of this project.

DNA Cryptography can be used to prevent cyber crimes like hacking, and provide

secure channel for communication.

NIT KURUKSHETRA 47

DNA CRYPTOGRAPHY

APPENDIX

Abbreviations Fullforms

DNA Deoxyribose Nucleic Acid

RNA Ribose Nucleic Acid

PCR Polymer Chain Reaction

C Cytosine

T Thymine

A Adenine

G Guanine

U Uracil

mRNA Messanger Ribose Nucleic Acid

tRNA Transfer Ribose Nucleic Acid

NIT KURUKSHETRA 48

DNA CRYPTOGRAPHY

Bibliography

Books & Literature

[1] “Herbert Schildt”, JAVA2 Complete Reference, Fifth Edition, Tata McGraw-Hill

Publishing Company Limited , 2004

[2] Scott W. Amber , JAVA2 Enterprise Edition 1.4 Bible ,Willey Publishing Inc. , 2003

[3] Java 5.0 API Documentation

Websites

[4] Hodorogea Tatiana, Vaida Mircea-Florin , Borda Monica, Streletchi Cosmin,

A Java Crypto Implementation of DNAProvider Featuring Complexity in Theory and

Practice, IEEE 2008

[5] Sherif T. Amin , Magdy Saeb , Salah El-Gindi,

A DNA-based Implementation of YAEA Encryption Algorithm

[6] Guangzhao Cui , Limin Qin , Yanfeng Wang , Xuncai Zhang

An Encryption Scheme Using DNA Technology, IEEE 2008

[7] Ning Kang, A Pseudo DNA Cryptography Method

[8] Geoff C. Smith, Ceridwyn C. Fiddes, Jonathan P. Hawkins & Jonathan P.L. Cox,∗Some possible codes for encrypting data in DNA,

Biotechnology Letters 25: 1125–1130, 2003.

NIT KURUKSHETRA 49

dna (1)

Documents

sequence of dna

dna sequence

dna complement

dna molecules

dna sequencing

dna replication

target dna

dna cryptography chapter