VISVESVARAYA TECHNOLOGICAL UNIVERSITY “ANALYSIS OF ...

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

“Jnana Sangama”, Belagavi – 590 018

A PROJECT REPORT ON

“ANALYSIS OF DIFFERENT STRING MATCHING

ALGORITHMS”

Submitted in partial fulfillment for the award of the degree of

BACHELOR OF ENGINEERING

IN

COMPUTER SCIENCE AND ENGINEERING

BY

BHUPALAM SNEHA (1NH12CS714)

DIVYA PEDDI REDDY (1NH12CS721)

NIHARIKA K (1NH12CS736)

SRUTHI VEGI (1NH12CS757)

Under the guidance of Ms. DEEPIKA.N

(Senior Assistant Professor, Dept. of CSE, NHCE)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

NEW HORIZON COLLEGE OF ENGINEERING

(ISO-9001:2000 certified, Accredited by NAAC ‘A’ ,

Permanently affiliated to VTU)

Outer Ring Road, Panathur Post, Near Marathalli,

Bangalore – 560103

NEW HORIZON COLLEGE OF ENGINEERING

(ISO-9001:2000 certified, Accredited by NAAC „A‟

Permanently affiliated to VTU)

Outer Ring Road, Panathur Post, Near Marathalli,

Bangalore-560 103

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE

Certified that the project work entitled “ANALYSIS OF DIFFERENT STRING

MATCHING ALGORITHMS” carried out by BHUPALAM SNEHA (1NH12CS714),

DIVYA PEDDI REDDY (1NH12CS721), NIHARIKA K (1NH12CS736) and SRUTHI

VEGI (1NH12CS757) bonafide students of NEW HORIZON COLLEGE OF

ENGINEERING in partial fulfillment for the award of Bachelor Of Engineering in

Computer Science and Engineering of the Visvesvaraya Technological University,

Belgaum during the year 2015-2016. It is certified that all corrections/suggestions indicated

for Internal Assessment have been incorporated in the report deposited in the department

library. The project report has been approved as it satisfies the academic requirements in

respect of Project work prescribed for the said Degree.

Name & Signature of Guide Name Signature of HOD Signature of Principal

(Ms.DEEPIKA.N) (Dr. Prashanth C.S.R.) (Dr. Manjunatha)

External Viva

Name of Examiner Signature with date

1.

2.

I

ACKNOWLEDGEMENT

The satisfaction and euphoria that accompany the successful completion of any task

would be, but impossible without the mention of the people who made it possible, whose

constant guidance and encouragement crowned our efforts with success.

We thank the management, Dr. Mohan Manghnani, Chairman of NEW HORIZON

EDUCTIONAL INSTITUTIONS for providing necessary infrastructure and creating good

environment.

We also record here the constant encouragement and facilities extended to us by Dr.

Manjunatha, Principal, NHCE and Dr. Prashanth.C.S.R, Dean Academics, Head of the

Department of Computer Science and Engineering. We extend our sincere gratitude to

them.

We express our gratitude to Ms. DEEPIKA.N, our project guide for constantly

monitoring the development of the project and setting up precise deadlines. Their valuable

suggestions were the motivating factors in completing the work.

We would also like to express our gratitude to NHCE and to all our external guides

at NHCE for their continuous guidance and motivation.

Finally a note of thanks to the teaching and non-teaching staff of Computer Science

and Engineering Department for their cooperation extended to us and our friends, who

helped us directly or indirectly in the course of the project work.

BHUPALAM SNEHA (1NH12CS714)

DIVYA PEDDI REDDY (1NH12CS721)

NIHARIKA K (1NH12CS736)

SRUTHI VEGI (1NH12CS757)

II

ABSTRACT

String matching is the problem of finding all occurrences of a character pattern in

a text. In this project, we have analyzed several algorithms, such as Naive string matching

algorithm, Rabin-Karp, Knuth-Morris-Pratt, Finite Automata. We analyzed the core ideas

of these single pattern string matching algorithms and multi-pattern string matching

algorithms. We compared the matching efficiencies of these algorithms by searching speed,

pre-processing time, matching time and the key ideas used in these algorithms.

The applicability of the various strings matching algorithms are being described. This

describes the optimal algorithm for various activities that include string matching as an

important aspect of functionality. In all applications test string and pattern class needs to be

matched always.

III

CONTENTS

1. INTRODUCTION

1.1. ABSTRACT 1

1.2. PROBLEM DEFINITION 1

1.3. PROJECT PURPOSE 1

1.4. PROJECT FEATURES 3

2. LITERATURE SURVEY

2.1. STRING MATCHING 4

2.2. DIFFERENT STRING MATCHING ALGORITHMS 6

2.3. SOFTWARE DESCRIPTION 10

3. REQUIREMENT ANALYSIS

3.1. FUNCTIONAL REQUIREMENTS 14

3.2. HARDWARE REQUIREMENTS 18

3.3. SOFTWARE REQUIREMENTS 19

4. DESIGN

4.1. DESIGN GOALS 20

4.2. ALGORITHM TECHNIQUES 21

4.3. ALGORITHMS 22

4.4. GRAPH SNIPPET 26

5. IMPLEMENTATION

5.1. DATASET 27

5.2. GRAPHICAL USER INTERFACE 30

6. TESTING

6.1 UNIT TESTING 32

6.2 INTEGRATION TESTING 33

IV

6.3 VALIDATION TESTING 33

6.4 SYSTEM TESTING 34

6.5 TESTING OF INITIALIZATION AND UICOMPONENTS 35

7. SNAPSHOT

7.1 OUTPUT OF NAÏVE ALGORITHM 38

7.2 OUTPUT OF KNUTH-MORRIS PRATT ALGORITHM 39

7.3 OUTPUT OF FINITE AUTOMATA ALGORITHM 40

7.4 OUTPUT OF RABIN KARB ALGORITHM 41

7.5 OUTPUT OF ALGORITHMS COMPARISON PARAMETERS 42

8. CONCLUSION AND FUTURE ENHANCEMENT

8.1 CONCLUSION 44

8.2 FUTURE ENHANCEMENT 45

REFERENCES 46

V

LIST OF FIGURES

Fig 2.1 EXAMPLE FOR STRING MATCHING ALGORITHM 4

Fig 2.2 JAVA PERSPECTIVE 13

Fig 4.1 APPLICATION OF PATTERN MATCHING 20

Fig 5.1 BLOCK DIAGRAM TO SHOW READER AND

WRITER FUNCTION 28

Fig 5.2 BLOCK DIAGRAM TO SHOW JCOMBO BOX 31

Fig 6.1 THE TESTING PROCESS

LIST OF TABLES

34

Table 3.1 EXECUTION TIME OF ALGORITHMS 15

Table 3.2 PRE-PROCESSING TIME 18

Table 4.1 ALGORITHM NOTATION 21

Table 4.2 ALGORITHM TECHNIQUES 21

Table 6.1 TEST CASE WHEN INPUT IS TAKEN FROM DATA SET 36

Table 6.2 TEST CASE WHEN INPUT GIVEN BY USER 37

Analysis of Different String Matching Algorithms

Dept of CSE,NHCE 1

CHAPTER 1

INTRODUCTION

1.1ABSTRACT

A string matching algorithm aims to find one or several occurrences of a string within

another. The algorithm returns the position of the first character of the desired substring in

the text. There are many different solutions for this problem, this project presents the four

best-known string matching algorithms: Naive, Knuth-Morris-Pratt, Finite Automata and

Rabin-Karp. The results show that Finite Automata is the most effective algorithm to solve

the string matching problem in usual cases, and Rabin-Karp is a good alternative for some

specific cases, for example when the pattern and the alphabet are very small.

1.2 PROBLEM DEFINITION

The string matching problem can be formulated as follows: Pattern to be searched

is an array P[m] of length m and text (document) is an array T[n] of length n. Elements of

P and T are characters belonging to finite set Σ (example Σ = {a, b,.., z}). The problem is

to find all s ϵ [0, n-m] such that T[s + i] = P[i] for all i ϵ [1, m].

1.3 PROJECT PURPOSE

The purpose of is to study different algorithms for the String matching problem.

These algorithms are used for trying to find one, several or all occurrences of a defined


Dept of CSE,NHCE 2

string (pattern) in a larger string string (typically a text). The string matching problem has

a lot of different applications in multiple areas. First, an adapted and efficient algorithm of

this problem can aid to enhance the responsiveness of a text-editing software. Other

applications in information technology includes web search engines, spam filters, natural

language processing, computational biology (search of particular pattern in DNA

sequence), feature detection in digital images. There are different solutions that allow to

solve the string matching problem. First, we have the naive algorithm, the simplest one,

which tries to match the pattern to each string of the same length in the text. From the

1970s, several others algorithms, more sophisticated and more effective, have been

invented. In 1975, Knuth, Pratt and Morris invented the first algorithm that preprocesses

the pattern to obtain a better performance, it is the Knuth-Morris-Pratt Algorithm. In 1987,

Rabin and Karp propose an algorithm that is based on a completely different approach:

Rabin-Karp Algorithm, which computes a hash function for the pattern and then look for a

match by using the same hash function for each possible substring of the same length in the

text. In theory of computation, a branch of theoretical computer science, a deterministic

finite automaton (DFA)—also known as deterministic finite accepter (DFA) and

deterministic finite state machine—is a finite state machine that accepts/rejects finite

strings of symbols and only produces a unique computation (or run) of the automaton for

each input string. 'Deterministic' refers to the uniqueness of the computation. In search of

simplest models to capture the finite state machines, McCulloch and Pitts were among the

first researchers to introduce a concept similar to finite automaton in 1943. In this project,

we present the four algorithms mentioned above. The final goal is the comparison of these

https://en.wikipedia.org/wiki/Theory_of_computation



https://en.wikipedia.org/wiki/Theoretical_computer_science




https://en.wikipedia.org/wiki/Finite-state_machine#Acceptors_and_recognizers






https://en.wikipedia.org/wiki/Finite_state_machine




Dept of CSE,NHCE 3

algorithms. To achieve this, we implement, test, and compare the complexity, execution

time, pre-processing time, line of code and no. of comparisons of each algorithms. The

comparison will be executed in different situations: small and large alphabet, the pattern

might appear zero time, once, a few times or many times in the text depending to its length.

We can observe and analyze the effectiveness of algorithms by measuring their execution

times in these different conditions. We start with the simplest solution, the naive algorithm.

Then we show three more sophisticated and more efficient solution. After that, we show

and compare the obtained results of each algorithm in different considered cases, we

observe that the Finite Automata Algorithm is the best solution.

1.4 PROJECT FEATURES

Our project was developed to analyze pattern matching algorithms. Pattern

matching is that there are two strings one is text T [1.....n] i.e. is main string given and the

other is pattern P [1.......m] i.e. is the given string to be matched with the given main string

given m<=n. We have chosen 4 different algorithms and have analyzed them based on few

parameters like execution time, number of comparisons, complexity, line of code and pre-

processing time.. The parameters chosen analyses the best algorithm and the most efficient

one. We have also tried the string matching with words, lines, paragraphs, etc., using a

large data set .Hence our project works not just for small texts but also for relatively big

data. Thus, from the analyzation of our project, we have discovered that finite automation

string matching algorithm works the fastest and is more efficient.


Dept of CSE,NHCE 4

CHAPTER 2

LITERATURE SURVEY

2.1 STRING MATCHING

In order to devise a survey of string matching algorithm, we observe the means used to

answer two types of search models: (a) is a word (depends on the language) (b) is any

sequence starting in an index- point. In order to these models, the answer models are: Exact

match and approximate match respectively. In the remainder of this section we review the

recent updated and hybrid algorithms. The exact string matching algorithms deal with

finding all not part occurrences of pattern P in text T. We classify exact string matching

approaches based on different character comparison methods. We differentiate between

classical, deterministic finite automata, bit-parallelism and hashing string matching

algorithms. Classical Method Classical string searching algorithms are based on character

comparisons.

Fig 2.1 Example for string matching.


Dept of CSE,NHCE 5

Naive Algorithm: This algorithm could be considered the simplest string matching

algorithm, since it performs character comparisons between the scanned text substring and

the complete pattern from left to right. In the case of a mismatch or a complete match it

shifts exactly one position to the right. It requires no preprocessing phase and no extra

space.

Knuth-Morris (KMP) Algorithm 1977: This algorithm searches for occurrences of a

pattern P within a main text X from left to right by employing the observation that when a

mismatch occurs, what is the most we can shift the pattern so as to avoid redundant

comparisons, thus benefiting from previously matched characters.

Rabin-Karp Algorithm 1987: R. Karp and M. Rabin published the randomized

fingerprint method as a practical and efficient solution to the string-matching problem.

(Karp & Rabin, 1987) The randomized fingerprint method is a perfect match for our

solution because it carries information forward from one comparison to the next, it

performs well in practice, and we can generalize it to extend to other related problems.

The Rabin-Karp algorithm uses modulo arithmetic, Horner’s Rule, and a number of other

innovative techniques to calculate a fingerprint (decimal number) for each substring in a

larger text file T. The algorithm first calculates pattern P’s fingerprint (denoted as p.)

Then, it iterates through a text file T for every location. At each iteration in T we are at

(offset/position/ or shift) location, denoted as s. Now, it calculates a fingerprint for a

pattern-length substring beginning at s. If a substring’s fingerprint is not equal to p, the


Dept of CSE,NHCE 6

substring will definitely not match the pattern making it a perfect heuristic for string

matching. It also has another advantage that helps speedup the comparison process.

Automaton Matcher Algorithm 1974: It is the first linear algorithm based on

deterministic automata, it scans the text character by character, from left to right,

performing transitions on the automaton. Classical/dynamic programming Method

Classical method as we mentioned earlier in exact string matching based on character

comparisons. Dynamic programming approach also is a classical solution that computes

the distance between strings.

2.2 DIFFERENT STRING MATCHING ALGORITHMS

1. RABIN KARP

2. KNUTH MORRIS PRATT

3. GALIL SEFERAS

4. BOYER MOORE

5. BERRY RAVINDRAN

6. SMITH

7. RAITA

8. HORSPOOL

9. BRUTE FORCE

10. SHIFT FOR

11. REVERSE COLUSSI


Dept of CSE,NHCE 7

12. QUICK SEARCH

13. REVERSE FACTOR

14. OPTIMAL MISMATCH

2.2.1 HORSPOOL ALGORITHM

The bad-character shift used in the Boyer-Moore algorithm is not very efficient for

small alphabets, but when the alphabet is large compared with the length of the

pattern, as it is often the case with the ASCII table and ordinary searches made under

a text editor, it becomes very useful.

Using it alone produces a very efficient algorithm in practice. Horspool proposed to

use only the bad-character shift of the rightmost character of the window to compute

the shifts in the Boyer-Moore algorithm.

2.2.2 SMITH ALGORITHM

The Smith–Waterman algorithm performs local sequence alignment; that is, for

determining similar regions between two strings or nucleotide or protein sequences.

Instead of looking at the total sequence, the Smith–Waterman algorithm compares

segments of all possible lengths and optimizes the similarity measure. The Smith–

Waterman algorithm (SWA) is fairly demanding of time: To align two sequences of

lengths m and n, O(mn) time is required. Smith–Waterman local similarity scores can

be calculated in O(m) (linear) space if only the optimal alignment needs to be found,

but naive algorithms to produce the alignment require O(mn) space

https://en.wikipedia.org/wiki/Big_O_notation




Dept of CSE,NHCE 8

2.2.3 BERRY RAVINDRAN

Berry and Ravindran designed an algorithm which performs the shifts by

considering the bad-character shift for the two consecutive text characters immediately to

the right of the window.

2.2.4 REVERSE COLUSSI

The character comparisons are done using a specific order given by a table h.

For each integer i such that 0 i m we define two disjoint sets:

Pos(i)={k : 0 k i and x[i] = x[i-k]}

Neg(i)={k : 0 k i and x[i] x[i-k]}

2.2.5 QUICK SEARCH ALGORITHM

The Quick Search algorithm uses only the bad-character shift table (see chapter

BoyerMoore algorithm). After an attempt where the window is positioned on the text factor

y[j .. j+m-1], the length of the shift is at least equal to one. So, the character y[j+m] is

necessarily involved in the next attempt, and thus can be used for the bad-character shift of

the current attempt.

2.2.6 OPTIMAL MISMATCH

The preprocessing phase of the Optimal Mismatch algorithm consists in sorting the pattern

characters in decreasing order of their frequencies and then in building the Quick Search

http://www-igm.univ-mlv.fr/~lecroq/string/node14.html#SECTION00140







Dept of CSE,NHCE 9

bad-character shift function (see chapter Quick Search algorithm) and a goodsuffix shift

function adapted to the scanning order of the pattern characters. It can be done in O(m2+ )

time and O(m+ ) space complexity.

2.2.7 RAITA ALGORITHM

Raita algorithm searches for a pattern "P" in a given text "T" by comparing each character

of pattern in the given text. Searching will be done as follows. Window for a text "T" is

defined as the length of "P".

1. First, last character of the pattern is compared with the rightmost character of the

window.

2. If there is a match, first character of the pattern is compared with the leftmost

character of the window.

3. If they match again, it compares the middle character of the pattern with middle

character of the window.

2.2.8 REVERSE FACTOR ALGORITHM

The Reverse Factor algorithm parses the characters of the window from right to left with

the automaton S(xR), starting with state q0. It goes until there is no more transition defined

for the current character of the window from the current state of the automaton. At this

moment it is easy to know what is the length of the longest prefix of the pattern which has

been matched: it corresponds to the length of the path taken in S(xR) from the start state q0





Dept of CSE,NHCE 10

to the last final state encountered. Knowing the length of this longest prefix, it is trivial to

compute the right shift to perform.

2.3 SOFTWARE DESCRIPTION

2.3.1 JAVA

Java is a set of computer software and specifications developed by Sun Microsystems,

which was later acquired by the Oracle Corporation, that provides a system for developing

application software and deploying it in a crossplatform computing environment. Java is

used in a wide variety of computing platforms from embedded devices and mobile phones

to enterprise servers and supercomputers. While they are less common than standalone Java

applications, Java applets run in secure, sandboxed environments to provide many features

of native applications and can be embedded in HTML pages.

Writing in the Java programming language is the primary way to produce code that

will be deployed as byte code in a Java Virtual Machine (JVM); byte code compilers are

also available for other languages, including Ada, JavaScript, Python, and Ruby. In

addition, several languages have been designed to run natively on the JVM, including

Scala, Clojure and Groovy. Java syntax borrows heavily from C and C++, but object-

oriented features are modeled after Smalltalk and Objective-C.[11] Java eschews certain

low-level constructs such as pointers and has a very simple memory model where every

object is allocated on the heap and all variables of object types are references.

Memory management is handled through integrated automatic garbage

https://en.wikipedia.org/wiki/Computer_software



https://en.wikipedia.org/wiki/Sun_Microsystems



https://en.wikipedia.org/wiki/Oracle_Corporation

https://en.wikipedia.org/wiki/Oracle_Corporation

https://en.wikipedia.org/wiki/Application_software




https://en.wikipedia.org/wiki/Cross-platform




https://en.wikipedia.org/wiki/Computing_platform




https://en.wikipedia.org/wiki/Embedded_device



https://en.wikipedia.org/wiki/Mobile_phone



https://en.wikipedia.org/wiki/Enterprise_server



https://en.wikipedia.org/wiki/Supercomputer

https://en.wikipedia.org/wiki/Java_applet




https://en.wikipedia.org/wiki/Sandbox_(computer_security)



https://en.wikipedia.org/wiki/HTML



https://en.wikipedia.org/wiki/Java_(programming_language)



https://en.wikipedia.org/wiki/Java_byte_code



https://en.wikipedia.org/wiki/Java_Virtual_Machine



https://en.wikipedia.org/wiki/Compiler



https://en.wikipedia.org/wiki/Ada_(programming_language)




https://en.wikipedia.org/wiki/JavaScript

https://en.wikipedia.org/wiki/JavaScript

https://en.wikipedia.org/wiki/Python_(programming_language)


https://en.wikipedia.org/wiki/Ruby_(programming_language)

https://en.wikipedia.org/wiki/Ruby_(programming_language)

https://en.wikipedia.org/wiki/Scala_(programming_language)




https://en.wikipedia.org/wiki/Clojure



https://en.wikipedia.org/wiki/Groovy_(programming_language)




https://en.wikipedia.org/wiki/Java_syntax

https://en.wikipedia.org/wiki/Java_syntax

https://en.wikipedia.org/wiki/C_(programming_language)



https://en.wikipedia.org/wiki/C%2B%2B

https://en.wikipedia.org/wiki/C%2B%2B

https://en.wikipedia.org/wiki/Smalltalk



https://en.wikipedia.org/wiki/Objective-C




https://en.wikipedia.org/wiki/Java_(software_platform)#cite_note-11



https://en.wikipedia.org/wiki/Pointer_(computer_programming)



https://en.wikipedia.org/wiki/Dynamic_memory_allocation



https://en.wikipedia.org/wiki/Reference_(computer_science)



https://en.wikipedia.org/wiki/Garbage_collection_(computer_science)



Dept of CSE,NHCE 11

collection performed by the JVM.

On November 13, 2006, Sun Microsystems made the bulk of its implementation

of Java available under the GNU General Public License (GPL).

PLATFORM

The Java platform is a suite of programs that facilitate developing and running programs

written in the Java programming language. A Java platform will include an execution

engine (called a virtual machine), a compiler and a set of libraries; there may also be

additional servers and alternative libraries that depend on the requirements. Java is not

specific to any processor or operating system as Java platforms have been implemented for

a wide variety of hardware and operating systems with a view to enable Java programs to

run identically on all of them.

2.3.2 ECLIPSE PLATFORM ARCHITECTURE

Eclipse uses plug-ins to provide all the functionality within and on top of the runtime

system. Its runtime system is based on Equinox, an implementation of the OSGi core

framework specification.

In addition to allowing the Eclipse Platform to be extended using other programming

languages, such as C and Python, the plug-in framework allows the Eclipse Platform to

work with typesetting languages like LaTeX[31] and networking applications such as telnet

and database management systems. The plug-in architecture supports writing any desired

extension to the environment, such as for configuration management. Java and CVS



https://en.wikipedia.org/wiki/GNU_General_Public_License






https://en.wikipedia.org/wiki/Virtual_machine

https://en.wikipedia.org/wiki/Virtual_machine

https://en.wikipedia.org/wiki/Library_(computing)



https://en.wikipedia.org/wiki/Server_(computing)



https://en.wikipedia.org/wiki/Operating_system



https://en.wikipedia.org/wiki/Write_once,_run_anywhere




https://en.wikipedia.org/wiki/Equinox_(OSGi)

https://en.wikipedia.org/wiki/Equinox_(OSGi)

https://en.wikipedia.org/wiki/OSGi



https://en.wikipedia.org/wiki/Programming_language








https://en.wikipedia.org/wiki/LaTeX





https://en.wikipedia.org/wiki/Telnet




https://en.wikipedia.org/wiki/Database_management_system

https://en.wikipedia.org/wiki/Database_management_system

https://en.wikipedia.org/wiki/Configuration_management



https://en.wikipedia.org/wiki/Concurrent_Versions_System




Dept of CSE,NHCE 12

support is provided in the Eclipse SDK, with support for other version control systems

provided by third-party plug-ins.

With the exception of a small run-time kernel, everything in Eclipse is a plug-in.

This means that every plug-in developed integrates with Eclipse in exactly the same way

as other plug-ins; in this respect, all features are "created equal".[citation needed] Eclipse

provides plug-ins for a wide variety of features, some of which are through third parties

using both free and commercial models. Examples of plug-ins include for UML, for

Sequence and other UML diagrams, a plug-in for DB Explorer, and many others.

The Eclipse SDK includes the Eclipse Java development tools (JDT), offering an

IDE with a built-in incremental Java compiler and a full model of the Java source files. This

allows for advanced refactoring techniques and code analysis. The IDE also makes use of

a workspace, in this case a set of metadata over a flat filespace allowing external file

modifications as long as the corresponding workspace "resource" is refreshed afterwards.

Eclipse implements the graphical control elements of the Java toolkit called SWT, whereas

most Java applications use the Java standard Abstract Window Toolkit (AWT) orSwing.

Eclipse's user interface also uses an intermediate graphical user interface layer called JFace,

which simplifies the construction of applications based on SWT. Eclipse was made to run

on Wayland during a GSoC-Project in 2014.

https://en.wikipedia.org/wiki/Software_development_kit

https://en.wikipedia.org/wiki/Software_development_kit

https://en.wikipedia.org/wiki/Version_control_system




https://en.wikipedia.org/wiki/Wikipedia:Citation_needed



https://en.wikipedia.org/wiki/Unified_Modeling_Language

https://en.wikipedia.org/wiki/Unified_Modeling_Language

https://en.wikipedia.org/wiki/Incremental_compiler



https://en.wikipedia.org/wiki/Refactor



https://en.wikipedia.org/wiki/Metadata



https://en.wikipedia.org/wiki/Graphical_control_element



https://en.wikipedia.org/wiki/Standard_Widget_Toolkit



https://en.wikipedia.org/wiki/Abstract_Window_Toolkit



https://en.wikipedia.org/wiki/Swing_(Java)

https://en.wikipedia.org/wiki/Graphical_user_interface



https://en.wikipedia.org/wiki/JFace

https://en.wikipedia.org/wiki/JFace

https://en.wikipedia.org/wiki/Wayland_(display_server_protocol)



https://en.wikipedia.org/wiki/Google_Summer_of_Code




Dept of CSE,NHCE 13

Fig 2.3 Java Perspective.

CHAPTER 3


Dept of CSE,NHCE 14

REQUIREMENT ANALYSIS 3.1

FUNCTIONAL REQUIREMENTS

The following are the parameters to be analyzed:

• Execution time

• Complexity

• Total number of comparisions

• Line of code

• Pre-processing time

3.1.1 EXECUTION TIME

The execution time or CPU time of a given task is defined as the time spent by the system

executing that task, including the time spent executing run-time or system services on its

behalf. The mechanism used to measure execution time is implementation defined. It is

implementation defined which task, if any, is charged the execution time that is consumed

by interrupt handlers and run-time services on behalf of the system.

The type CPU_Time represents the execution time of a task. The set of values of this type

corresponds one-to-one with an implementation-defined range of mathematical

integers.

CPU_Time_Start and CPU_Time_End are the smallest and largest values of the

CPU_Time type, respectively.


Dept of CSE,NHCE 15

Algorithm Execution time

NAÏVE O((n-m+1)m)

KNUTH MORRIS PRATT O(m+n)

FINITE AUTOMATA O(m* NO_OF_CHARS)

RABIN KARP O(m+n)

Table 3.1 EXECUTION TIME OF ALGORITHMS

3.1.2 COMPLEXITY:

Algorithmic complexity is concerned about how fast or slow particular algorithm

performs. We define complexity as a numerical function T(n) - time versus the input size

n. We want to define time taken by an algorithm without depending on the implementation

details. But you agree that T(n) does depend on the implementation! A given algorithm will

take different amounts of time on the same inputs depending on such factors as: processor

speed; instruction set, disk speed, brand of compiler and etc. The way around is to estimate

efficiency of each algorithm asymptotically. We will measure time T(n) as the number of

elementary "steps" (defined in any way), provided each such step takes constant time.


Dept of CSE,NHCE 16

Let us consider two classical examples: addition of two integers. We will add two integers

digit by digit (or bit by bit), and this will define a "step" in our computational model.

Therefore, we say that addition of two n-bit integers takes n steps. Consequently, the total

computational time is T(n) = c * n, where c is time taken by addition of two bits. On

different computers, additon of two bits might take different time, say c1 and c2, thus the

additon of two n-bit integers takes T(n) = c1 * n and T(n) = c2* n respectively. This shows

that different machines result in different slopes, but time T(n) grows linearly as input size

increases.

3.1.3 TOTAL NUMBER OF COMPARISIONS:

It gives the count of the comparisons made between a text and a pattern. In other words,

the pattern is compared with each and every string or a line of text. This is done both

logically and physically.

3.1.4 LINE OF CODE:

Line of code (LOC) is a software metric used to measure the size of the computer program

by counting the number of lines in the text of the program’s source code. It is typically used

to predict the amount of effort that will be required to develop a program, as well as to

estimate programming productivity or maintainability once the software is produced.

There are two major types of LOC:

• Physical LOC


Dept of CSE,NHCE 17

• Logical LOC

Specific definitions of these two measures vary, but the most common definition of physical

LOC is a count of lines in the text of the program’s source code excluding the comment

lines.

Logical LOC attempts to measure the number of executable ―statements‖, but their

specific definitions are tied to specific computer languages. It is much easier to create tools

that measure physical LOC and physical LOC definitions are easier to explain.

However, physical LOC measures are sensitive to logically irrelevant formatting and style

conventions, while logical LOC is less sensitive to formatting and style conventions.

However, LOC measures are often stated without giving their definition, and logical LOC

can often be significantly different from physical LOC.

3.1.5 PRE-PROCESSING TIME:

The following table shows the pre-processing time of different algorithms:

Algorithm Pre-processing time


Dept of CSE,NHCE 18

NAÏVE 0

KNUTH MORRIS PRATT O(m)

FINITE AUTOMATA O(m*n)

RABIN KARP O(m)

Table 3.2 Pre-processing time

3.2 HARDWARE REQUIREMENTS

Processor : Any Processor above 500 MHz

RAM : 1 TB

Hard Disk : 4 GB

Input device : Standard Keyboard and Mouse

Output device : High Resolution Monitor

3.3 SOFTWARE REQUIREMENTS

• Operating system : Windows 10

• Front End : Eclipse 4.5 Mars in the Java EE perspective


Dept of CSE,NHCE 19

• Platform : Java SE, Standard Widget Toolkit

• Type : Integrated Development Environment(IDE)

• Server : Internet Information Services

CHAPTER 4


Dept of CSE,NHCE 20

DESIGN

4.1 DESIGN GOALS

4.1.1 NEED OF PATTERN MATCHING

Pattern matching is the process of checking a perceived sequence of string for the presence

of the constituents of some pattern. In contrast to pattern recognition, the match usually has

to be exact. The patterns generally have the form sequences of pattern matching include

outputting the locations of a pattern within a string sequence, to output some component of

the matched pattern, and to substitute the matching pattern with some other string sequence

(i.e., search and replace). Pattern matching concept is used in many applications Following

figure shows the different applications.

Fig4.1 Applications of pattern matching

4.1.2 NOTATION USED BY THE ALGORITHM


Dept of CSE,NHCE 21

NOTATION USED FOR

N Length of the text

M Length of the pattern(string)

C The size of the alphabet

Table 4.1 Algorithm notation.

4.2 ALGORITHM TECHNIQUES

Every algorithm uses some special techniques to find pattern matching. Following table

shows the different techniques used by different algorithms.

ALGORITHMS TECHNIQUES

NAÏVE STRING MATCHING

ALGORITHM

Each character of the pattern is compared

to a substring of the text which is the

length of the pattern, until there is a

mismatch or a match.

KNUTH-MORRIS PRATT STRING

MATCHING ALGORITHM

Two indices l and r into text string t

FINITE AUTOMATA STRING

MATCHING ALGORITHM

Subset construction

RABIN-KARP STRING MATCHING

ALGORITHM

Hashing

Table 4.2 Algorithm techniques.


Dept of CSE,NHCE 22

4.3 ALGORITHMS

4.3.1 NAÏVE STRING MATCHING ALGORITHM

NAIVE_STRING_MATCHER(T,P)

1. n length[T]

2. m length[P]

3. for s 0 to n-m do

4. if P[1…..m]=T[s+1…..s+m]

5. then print ―Pattern occurs with shift‖ s

4.3.2 KNUTH-MORRIS PRATT STRING MATCHING

ALGORITHM

KMP_MATCHER(T,P)

1. n length[T]

2. m length[P]

3. ∏ KMP-PREFIX(P)

4. i 0

5. for j 1 to n do

6. while i > 0 and P[i+1] ≠ T[i]


Dept of CSE,NHCE 23

7. do i ∏[i]

8. if P[i+1] = T[j]

9. then i i+1

10. if i = m

11. then print ―Pattern occurs with shift‖ j-m

12. i ∏[i]

KMP-PREFIX(P)

1. m length[P]

2. ∏[1] 0

3. i 0

4. for j 2 to m do

5. while i > 0 and P[i+1] ≠ P[j]

6. do i ∏[i]

7. if P[i+1] = P[i]

8. then i i+1

9. ∏[j] i

10. return ∏


Dept of CSE,NHCE 24

4.3.3 FINITE AUTOMATA STRING MATCHING ALGORITHM

DFAPatternSearch (pattern,text)

1. nextState = computeNextStateFunction(pattern)

2. currentState = 0

3. for textCursor =0 to n-1

4. character = text[textCursor]

5. currentState = nextState[currentState][character]

6. if currentState = m

7. return (textCursor – m+1)

8. endif

9. endfor

10. return -1 computeNextStateFunction (pattern)

1. for state=0 to m-1

2. for character=smallestChar to largestChar

3. patternPlusChar = concatenate (pattern,character)

4. k = length of longest suffix of patternPlusChar that is a prefix of pattern

5. nextState[state][character]=k

6. endfor

7. endfor

4.3.4 RABIN KARP STRING MATCHING ALGORITHM

RABIN-KARP-MATCHER (T, P , d , q)


Dept of CSE,NHCE 25

1. n length[T]

2. m length[P]

3. h d^(m-1) mod q

4. p 0

5. to 0

6. for i 0 to m

7. do p (d*p + P[i]) mod q

8. to (d*to + T[i]) mod q

9. for s 0 to n-m

10. do if p = ts

11. then if P[1….m] = T[s+1....s+m]

12. then print ―Pattern occurs with shift‖ s

13. if s < n - m

14. then ts+1 (d(ts – T[s+1])h) + T[s+m+1] mod q

4.4 APACHE POI

Helps in creating a spreadsheet and manipulate it using Java. Spreadsheet is a page in an

Excel file; it contains rows and columns with specific names.


Dept of CSE,NHCE 26

Operations performed are:

1) Create a blank workbook

2) create a blank spreadsheet

3) create row

4) insert cell values

5) close

4.5 GRAPH SNIPPET

HSSFWorkbook workbook = new HSSFWorkbook();

HSSFSheet sheet = workbook.getSheet("FirstSheet");

HSSFRow rowhead = sheet.createRow((short)0);

rowhead.createCell(0).setCellValue("run tym"); HSSFRow

row = sheet.createRow((short)1);

row.createCell(0).setCellValue(extime);

FileOutputStream fileOut = new FileOutputStream(filename);

workbook.write(fileOut); fileOut.close();

CHAPTER 5


Dept of CSE,NHCE 27

IMPLEMENTATION

5.1 DATA SET

A data set is a collection of related, discrete items of related data that may be accessed

individually or in combination or managed as a whole entity. A Data Set can be database

or a normal text file. Here in our project we are using a Text File as a data set.

Functions Used To Import Data from a Text File:

BufferedReader class reads text from a character-input stream, buffering characters so as

to provide for efficient reading of characters, arrays, lines. The Buffer size may be specified

or may be default size. The default is large enough for most purposes. Here is how it looks:

BufferedReader bufferedReader= new BufferedReader(new FileReader(“my file”);

SCANNER is a class which allows the user to read values of various types. The scanner

looks for token in the input. A token is a series of characters that ends with what java calls

whitespace. Scanner breaks its input into tokens using a delimiter pattern, which by default

matches whitespace. A simple text scanner which can parse primitive types and strings

using regular expression. Scanner class belongs to java.util package.

There are two constructors that are particularly useful: one takes an InputStream object as

a parameter and the other takes a FileReader object as a parameter.

Scanner in= new Scanner(System.in) ; //system.in is input stream


Dept of CSE,NHCE 28

Scanner in File=new Scanner(new FileReader(“myFile”));

Fig 5.1 Block diagram to show reader and writer function

5.1.1 INPUT TEXT FILE

A Walk to Remember

NICHOLAS SPARKS


Dept of CSE,NHCE 29

Prologue When I was seventeen, my life changed forever. I know that there are people

who wonder about me when I say this. They look at me strangely as if trying to

fathom what could have happened back then, though I seldom bother to explain.

Because I've lived here for most of my life, I don't feel that I have to unless it's on my

terms, and that would take more time than most people are willing to give me. My

story can't be summed up in two or three sentences; it can't be packaged into

something neat and simple that people would immediately understand. Despite the

passage of forty years, the people still living here who knew me that year accept my

lack of explanation without question. My story in some ways is their story because it

was something that all of us lived through. It was I, however, who was closest to it.

I'm fifty-seven years old, but even now I can remember everything from that year,

down to the smallest details. I relive that year often in my mind, bringing it back to

life, and I realize that when I do, I always feel a strange combination of sadness and

joy. There are moments when I wish I could roll back the clock and take all the

sadness away, but I have the feeling that if I did, the joy would be gone as well. So I

take the memories as they come, accepting them all, letting them guide me whenever I

can. This happens more often than I let on. It is April 12, in the last year before the

millennium, and as I leave my house, I glance around. The sky is overcast and gray,

but as I move down the street, I notice that the dogwoods and azaleas are blooming.

Old Hegbert, he'd stop dead in his tracks and his ears would perk up-I swear to God,

they actually moved-and he'd turn this bright shade of red, like he'd just drunk

gasoline, and the big green veins in his neck would start sticking out all over, like


Dept of CSE,NHCE 30

those maps of the Amazon River that you see in National Geographic. He'd peer from

side to side, his eyes narrowing into slits as he searched .

SAMPLE INPUT TEXT FILE

5.2 GRAPHICAL USER INTERFACE:

We have used JCombo Box for the user interaction. The class JComboBox is a component

which combines a button or editable field and a drop-down list.

Following is the declaration for javax.swing.JComboBox class −

public class JComboBox extends JComponent

implements ItemSelectable, ListDataListener,

ActionListener, Accessible


Dept of CSE,NHCE 31

Fig 5.2 block diagram to show a view of JCombo Box

Analysis:

We have graphically analyzed the project using a excel sheet.

CHAPTER 6


Dept of CSE,NHCE 32

TESTING

The purpose of testing is to discover errors. Testing is the process of trying to discover

every conceivable fault or weakness in a work product. It provides a way to check the

functionality of components, sub-assemblies, assemblies and/or a finished product it is the

process of exercising software with the intent of ensuring that the Software system meets

its requirements and user expectations and does not fail in an unacceptable manner. There

are various types of test. Each test type addresses a specific testing requirement.

TYPES OF TESTS

6.1 UNIT TESTING

Unit testing involves the design of test cases that validate that the internal program

logic is functioning properly, and that program inputs produce valid outputs. All decision

branches and internal code flow should be validated. It is the testing of individual software

units of the application .it is done after the completion of an individual unit before

integration. This is a structural testing, that relies on knowledge of its construction and is

invasive. Unit tests perform basic tests at component level and test a specific business

process, application, and/or system configuration. Unit tests ensure that each unique path

of a business process performs accurately to the documented specifications and contains

clearly defined inputs and expected results.


Dept of CSE,NHCE 33

6.2 INTEGRATION TESTING

Integration tests are designed to test integrated software components to determine if they

actually run as one program. Testing is event driven and is more concerned with the basic

outcome of screens or fields. Integration tests demonstrate that although the components

were individually satisfaction, as shown by successfully unit testing, the combination of

components is correct and consistent. Integration testing is specifically aimed at exposing

the problems that arise from the combination of components.

6.3 VALIDATION TESTING

An engineering validation test (EVT) is performed on first engineering prototypes,

to ensure that the basic unit performs to design goals and specifications. It is important in

identifying design problems, and solving them as early in the design cycle as possible, is

the key to keeping projects on time and within budget. Too often, product design and

performance problems are not detected until late in the product development cycle — when

the product is ready to be shipped. The old adage holds true: It costs a penny to make a

change in engineering, a dime in production and a dollar after a product is in the

field.

Verification is a Quality control process that is used to evaluate whether or not a

product, service, or system complies with regulations, specifications, or conditions imposed

at the start of a development phase. Verification can be in development, scaleup, or

production. This is often an internal process.


Dept of CSE,NHCE 34

Validation is a Quality assurance process of establishing evidence that provides a

high degree of assurance that a product, service, or system accomplishes its intended

requirements. This often involves acceptance of fitness for purpose with end users and other

product stakeholders.

The testing process overview is as follows:

Fig 6.1 The testing process

6.4 SYSTEM TESTING

System testing of software or hardware is testing conducted on a complete,

integrated system to evaluate the system's compliance with its specified requirements.

System testing falls within the scope of black box testing, and as such, should require no

knowledge of the inner design of the code or logic.


Dept of CSE,NHCE 35

As a rule, system testing takes, as its input, all of the "integrated" software

components that have successfully passed integration testing and also the software system

itself integrated with any applicable hardware system(s).

System testing is a more limited type of testing; it seeks to detect defects both

within the "inter-assemblages" and also within the system as a whole.

System testing is performed on the entire system in the context of a Functional

Requirement Specification(s) (FRS) and/or a System Requirement Specification (SRS).

System testing tests not only the design, but also the behavior and even the believed

expectations of the customer. It is also intended to test up to and beyond the bounds defined

in the software/hardware requirements specification(s).

6.5 TESTING OF INITIALIZATION AND GUI COMPONENTS

GUIs are tested manually, often by the developers themselves. This is very unreliable and

expensive. For new GUIs or those being significantly changed, quality is low, and failures

at integration time or during user acceptance tests are common.

Screen Scraper/replay based GUI test techniques are adopted. In a few months, someone

notices that these don't work, though they are beloved by managers looking for cheap

solutions. The problem is that every time you change the screen layout all existing tests

become useless, which means you have no regression tests.

http://c2.com/cgi/wiki?ScreenScraper





Dept of CSE,NHCE 36

Serial Number of Test Case TC 01

Module Under Test DATASET

Description

When the program is executed any word

or phrase of user’s choice will be searched

from a text file (data set) along with

analysis of time complexity,

preprocessing time, line of code, Number

of comparisons and execution time.

Output If the pattern is present in the data set

match found along with analyzed

parameters else match not found.

Remarks Test Successful.

Table 6.1: Test case when input is taken from a Data set (text file)


Dept of CSE,NHCE 37

Serial Number of Test Case TC 01

Module Under Test USER INPUT

Description

When the program is executed, User can

give his input as text and thus a given

pattern will be searched from given text

along with analysis of time complexity,

preprocessing time, line of code, Number

of comparisons and execution time.

Output

If the pattern is present in the text, match

found along with analyzed parameters else

match not found.

Remarks Test Successful.

Table 6.2: Test case when input given by user


Dept of CSE,NHCE 38

CHAPTER 7

SNAPSHOTS

7.1 OUTPUT OF NAÏVE ALGORITHM


Dept of CSE,NHCE 39

Fig 7.1 Output of Naïve algorithm.

7.2 OUTPUT OF KNUTH-MORRIS PRATT ALGORITHM

Fig 7.2 Output of Knuth Morris Pratt algorithm.


Dept of CSE,NHCE 40

7.3 OUTPUT OF FINITE AUTOMATA ALGORITHM

Fig 7.3 Output of Finite automata algorithm.


Dept of CSE,NHCE 41

7.4 OUTPUT OF RABIN-KARP ALGORITHM

Fig 7.4 Output of Rabin Karp algorithm.


Dept of CSE,NHCE 42

7.5 COMPARISION OF ALGORITHMS OUTPUT

Fig 7.5.1 Output of algorithm comparison parameters.


Dept of CSE,NHCE 43

Fig 7.5.1 Output of algorithm comparison parameters.


Dept of CSE,NHCE 44

CHAPTER 8

CONCLUSION AND FUTURE ENHANCEMENT

8.1 CONCLUSION

String matching is the problem of finding all occurrences of a character pattern in a text.

This project provides an overview of different string matching algorithms and comparative

study of these algorithms. In this project, we have evaluated several algorithms like Naïve

string matching algorithm, Knuth Morris Pratt String matching algorithm, Rabin Karp

string matching algorithm and Finite automata String matching algorithm.

String matching algorithm plays the vital role in the Computational Biology. The functional

and structural relationship of the biological sequence is determined by similarities on that

sequence. For that, the researcher is supposed to aware of similarities on the biological

sequences. Pursuing of similarity among biological sequences is an important research area

of that can bring insight into the evolutionary and genetic relationships among the genes.

In this paper, we have studied different kinds of string matching algorithms and observed

their time complexity, Number of comparisons, Line of Code, Execution time and

Preprocessing time.

From the studying, it is analyzed that KMP algorithm relatively easier to implement

because never needs to move backwards in the input sequence. Rabin Karp algorithm used


Dept of CSE,NHCE 45

to detect the plagiarism. Naive algorithm do not require preprocessing of the text or the

pattern, the problem is that it’s very slow, it rarely produces efficient result.

The string-matching automaton is very efficient:, it examines each character in the text

exactly once and reports all the valid shifts. Finite automata string matching algorithm is

faster as well, compared to other algorithms.

Innovation and creativity in string matching can play an immense role for getting time

efficient performance in various domains of computer science.

8.2 FUTURE ENHANCEMENT

1. Space Complexity can be determined.

2. Algorithms can be applied to size more than the maximum size.

3. Connecting program to web or using Internet as a dataset.

4. Graphical User Interface can be built user friendly.


Dept of CSE,NHCE 46

REFERENCES

Good Teachers are worth more than thousand books, we have them in Our Department.

1. String Matching Methodologies: A Comparative Analysis by Akhtar Rasool,Amrita

Tiwari, Gunjan Singla,Nilay Khare Department of computer Science & Engg.

Maulana Azad National Institute of Technology.

2. A Comparative Study On String Matching Algorithms Of Biological Sequences by

Pandiselvam.P, Marimuthu.T, Lawrance. R Department of Computer Applications,

Ayya Nadar Janaki Ammal College.

3. Algorithms for String matching by- Marc GOU.

4. Handbook of Exact String-Matching Algorithms by- Christian Charras Thierry

Lecroq.

VISVESVARAYA TECHNOLOGICAL UNIVERSITY “ANALYSIS OF ...

Documents