Page 1
VISVESVARAYA TECHNOLOGICAL UNIVERSITY
“Jnana Sangama”, Belagavi – 590 018
A PROJECT REPORT ON
“ANALYSIS OF DIFFERENT STRING MATCHING
ALGORITHMS”
Submitted in partial fulfillment for the award of the degree of
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING
BY
BHUPALAM SNEHA (1NH12CS714)
DIVYA PEDDI REDDY (1NH12CS721)
NIHARIKA K (1NH12CS736)
SRUTHI VEGI (1NH12CS757)
Under the guidance of Ms. DEEPIKA.N
(Senior Assistant Professor, Dept. of CSE, NHCE)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
NEW HORIZON COLLEGE OF ENGINEERING
(ISO-9001:2000 certified, Accredited by NAAC ‘A’ ,
Permanently affiliated to VTU)
Outer Ring Road, Panathur Post, Near Marathalli,
Bangalore – 560103
Page 2
NEW HORIZON COLLEGE OF ENGINEERING
(ISO-9001:2000 certified, Accredited by NAAC „A‟
Permanently affiliated to VTU)
Outer Ring Road, Panathur Post, Near Marathalli,
Bangalore-560 103
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
CERTIFICATE
Certified that the project work entitled “ANALYSIS OF DIFFERENT STRING
MATCHING ALGORITHMS” carried out by BHUPALAM SNEHA (1NH12CS714),
DIVYA PEDDI REDDY (1NH12CS721), NIHARIKA K (1NH12CS736) and SRUTHI
VEGI (1NH12CS757) bonafide students of NEW HORIZON COLLEGE OF
ENGINEERING in partial fulfillment for the award of Bachelor Of Engineering in
Computer Science and Engineering of the Visvesvaraya Technological University,
Belgaum during the year 2015-2016. It is certified that all corrections/suggestions indicated
for Internal Assessment have been incorporated in the report deposited in the department
library. The project report has been approved as it satisfies the academic requirements in
respect of Project work prescribed for the said Degree.
Name & Signature of Guide Name Signature of HOD Signature of Principal
(Ms.DEEPIKA.N) (Dr. Prashanth C.S.R.) (Dr. Manjunatha)
External Viva
Name of Examiner Signature with date
1.
2.
Page 3
I
ACKNOWLEDGEMENT
The satisfaction and euphoria that accompany the successful completion of any task
would be, but impossible without the mention of the people who made it possible, whose
constant guidance and encouragement crowned our efforts with success.
We thank the management, Dr. Mohan Manghnani, Chairman of NEW HORIZON
EDUCTIONAL INSTITUTIONS for providing necessary infrastructure and creating good
environment.
We also record here the constant encouragement and facilities extended to us by Dr.
Manjunatha, Principal, NHCE and Dr. Prashanth.C.S.R, Dean Academics, Head of the
Department of Computer Science and Engineering. We extend our sincere gratitude to
them.
We express our gratitude to Ms. DEEPIKA.N, our project guide for constantly
monitoring the development of the project and setting up precise deadlines. Their valuable
suggestions were the motivating factors in completing the work.
We would also like to express our gratitude to NHCE and to all our external guides
at NHCE for their continuous guidance and motivation.
Finally a note of thanks to the teaching and non-teaching staff of Computer Science
and Engineering Department for their cooperation extended to us and our friends, who
helped us directly or indirectly in the course of the project work.
BHUPALAM SNEHA (1NH12CS714)
DIVYA PEDDI REDDY (1NH12CS721)
NIHARIKA K (1NH12CS736)
SRUTHI VEGI (1NH12CS757)
Page 4
II
ABSTRACT
String matching is the problem of finding all occurrences of a character pattern in
a text. In this project, we have analyzed several algorithms, such as Naive string matching
algorithm, Rabin-Karp, Knuth-Morris-Pratt, Finite Automata. We analyzed the core ideas
of these single pattern string matching algorithms and multi-pattern string matching
algorithms. We compared the matching efficiencies of these algorithms by searching speed,
pre-processing time, matching time and the key ideas used in these algorithms.
The applicability of the various strings matching algorithms are being described. This
describes the optimal algorithm for various activities that include string matching as an
important aspect of functionality. In all applications test string and pattern class needs to be
matched always.
Page 5
III
CONTENTS
1. INTRODUCTION
1.1. ABSTRACT 1
1.2. PROBLEM DEFINITION 1
1.3. PROJECT PURPOSE 1
1.4. PROJECT FEATURES 3
2. LITERATURE SURVEY
2.1. STRING MATCHING 4
2.2. DIFFERENT STRING MATCHING ALGORITHMS 6
2.3. SOFTWARE DESCRIPTION 10
3. REQUIREMENT ANALYSIS
3.1. FUNCTIONAL REQUIREMENTS 14
3.2. HARDWARE REQUIREMENTS 18
3.3. SOFTWARE REQUIREMENTS 19
4. DESIGN
4.1. DESIGN GOALS 20
4.2. ALGORITHM TECHNIQUES 21
4.3. ALGORITHMS 22
4.4. GRAPH SNIPPET 26
5. IMPLEMENTATION
5.1. DATASET 27
5.2. GRAPHICAL USER INTERFACE 30
6. TESTING
6.1 UNIT TESTING 32
6.2 INTEGRATION TESTING 33
Page 6
IV
6.3 VALIDATION TESTING 33
6.4 SYSTEM TESTING 34
6.5 TESTING OF INITIALIZATION AND UICOMPONENTS 35
7. SNAPSHOT
7.1 OUTPUT OF NAÏVE ALGORITHM 38
7.2 OUTPUT OF KNUTH-MORRIS PRATT ALGORITHM 39
7.3 OUTPUT OF FINITE AUTOMATA ALGORITHM 40
7.4 OUTPUT OF RABIN KARB ALGORITHM 41
7.5 OUTPUT OF ALGORITHMS COMPARISON PARAMETERS 42
8. CONCLUSION AND FUTURE ENHANCEMENT
8.1 CONCLUSION 44
8.2 FUTURE ENHANCEMENT 45
REFERENCES 46
Page 7
V
LIST OF FIGURES
Fig 2.1 EXAMPLE FOR STRING MATCHING ALGORITHM 4
Fig 2.2 JAVA PERSPECTIVE 13
Fig 4.1 APPLICATION OF PATTERN MATCHING 20
Fig 5.1 BLOCK DIAGRAM TO SHOW READER AND
WRITER FUNCTION 28
Fig 5.2 BLOCK DIAGRAM TO SHOW JCOMBO BOX 31
Fig 6.1 THE TESTING PROCESS
LIST OF TABLES
34
Table 3.1 EXECUTION TIME OF ALGORITHMS 15
Table 3.2 PRE-PROCESSING TIME 18
Table 4.1 ALGORITHM NOTATION 21
Table 4.2 ALGORITHM TECHNIQUES 21
Table 6.1 TEST CASE WHEN INPUT IS TAKEN FROM DATA SET 36
Table 6.2 TEST CASE WHEN INPUT GIVEN BY USER 37
Page 8
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 1
CHAPTER 1
INTRODUCTION
1.1ABSTRACT
A string matching algorithm aims to find one or several occurrences of a string within
another. The algorithm returns the position of the first character of the desired substring in
the text. There are many different solutions for this problem, this project presents the four
best-known string matching algorithms: Naive, Knuth-Morris-Pratt, Finite Automata and
Rabin-Karp. The results show that Finite Automata is the most effective algorithm to solve
the string matching problem in usual cases, and Rabin-Karp is a good alternative for some
specific cases, for example when the pattern and the alphabet are very small.
1.2 PROBLEM DEFINITION
The string matching problem can be formulated as follows: Pattern to be searched
is an array P[m] of length m and text (document) is an array T[n] of length n. Elements of
P and T are characters belonging to finite set Σ (example Σ = {a, b,.., z}). The problem is
to find all s ϵ [0, n-m] such that T[s + i] = P[i] for all i ϵ [1, m].
1.3 PROJECT PURPOSE
The purpose of is to study different algorithms for the String matching problem.
These algorithms are used for trying to find one, several or all occurrences of a defined
Page 9
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 2
string (pattern) in a larger string string (typically a text). The string matching problem has
a lot of different applications in multiple areas. First, an adapted and efficient algorithm of
this problem can aid to enhance the responsiveness of a text-editing software. Other
applications in information technology includes web search engines, spam filters, natural
language processing, computational biology (search of particular pattern in DNA
sequence), feature detection in digital images. There are different solutions that allow to
solve the string matching problem. First, we have the naive algorithm, the simplest one,
which tries to match the pattern to each string of the same length in the text. From the
1970s, several others algorithms, more sophisticated and more effective, have been
invented. In 1975, Knuth, Pratt and Morris invented the first algorithm that preprocesses
the pattern to obtain a better performance, it is the Knuth-Morris-Pratt Algorithm. In 1987,
Rabin and Karp propose an algorithm that is based on a completely different approach:
Rabin-Karp Algorithm, which computes a hash function for the pattern and then look for a
match by using the same hash function for each possible substring of the same length in the
text. In theory of computation, a branch of theoretical computer science, a deterministic
finite automaton (DFA)—also known as deterministic finite accepter (DFA) and
deterministic finite state machine—is a finite state machine that accepts/rejects finite
strings of symbols and only produces a unique computation (or run) of the automaton for
each input string. 'Deterministic' refers to the uniqueness of the computation. In search of
simplest models to capture the finite state machines, McCulloch and Pitts were among the
first researchers to introduce a concept similar to finite automaton in 1943. In this project,
we present the four algorithms mentioned above. The final goal is the comparison of these
Page 10
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 3
algorithms. To achieve this, we implement, test, and compare the complexity, execution
time, pre-processing time, line of code and no. of comparisons of each algorithms. The
comparison will be executed in different situations: small and large alphabet, the pattern
might appear zero time, once, a few times or many times in the text depending to its length.
We can observe and analyze the effectiveness of algorithms by measuring their execution
times in these different conditions. We start with the simplest solution, the naive algorithm.
Then we show three more sophisticated and more efficient solution. After that, we show
and compare the obtained results of each algorithm in different considered cases, we
observe that the Finite Automata Algorithm is the best solution.
1.4 PROJECT FEATURES
Our project was developed to analyze pattern matching algorithms. Pattern
matching is that there are two strings one is text T [1.....n] i.e. is main string given and the
other is pattern P [1.......m] i.e. is the given string to be matched with the given main string
given m<=n. We have chosen 4 different algorithms and have analyzed them based on few
parameters like execution time, number of comparisons, complexity, line of code and pre-
processing time.. The parameters chosen analyses the best algorithm and the most efficient
one. We have also tried the string matching with words, lines, paragraphs, etc., using a
large data set .Hence our project works not just for small texts but also for relatively big
data. Thus, from the analyzation of our project, we have discovered that finite automation
string matching algorithm works the fastest and is more efficient.
Page 11
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 4
CHAPTER 2
LITERATURE SURVEY
2.1 STRING MATCHING
In order to devise a survey of string matching algorithm, we observe the means used to
answer two types of search models: (a) is a word (depends on the language) (b) is any
sequence starting in an index- point. In order to these models, the answer models are: Exact
match and approximate match respectively. In the remainder of this section we review the
recent updated and hybrid algorithms. The exact string matching algorithms deal with
finding all not part occurrences of pattern P in text T. We classify exact string matching
approaches based on different character comparison methods. We differentiate between
classical, deterministic finite automata, bit-parallelism and hashing string matching
algorithms. Classical Method Classical string searching algorithms are based on character
comparisons.
Fig 2.1 Example for string matching.
Page 12
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 5
Naive Algorithm: This algorithm could be considered the simplest string matching
algorithm, since it performs character comparisons between the scanned text substring and
the complete pattern from left to right. In the case of a mismatch or a complete match it
shifts exactly one position to the right. It requires no preprocessing phase and no extra
space.
Knuth-Morris (KMP) Algorithm 1977: This algorithm searches for occurrences of a
pattern P within a main text X from left to right by employing the observation that when a
mismatch occurs, what is the most we can shift the pattern so as to avoid redundant
comparisons, thus benefiting from previously matched characters.
Rabin-Karp Algorithm 1987: R. Karp and M. Rabin published the randomized
fingerprint method as a practical and efficient solution to the string-matching problem.
(Karp & Rabin, 1987) The randomized fingerprint method is a perfect match for our
solution because it carries information forward from one comparison to the next, it
performs well in practice, and we can generalize it to extend to other related problems.
The Rabin-Karp algorithm uses modulo arithmetic, Horner’s Rule, and a number of other
innovative techniques to calculate a fingerprint (decimal number) for each substring in a
larger text file T. The algorithm first calculates pattern P’s fingerprint (denoted as p.)
Then, it iterates through a text file T for every location. At each iteration in T we are at
(offset/position/ or shift) location, denoted as s. Now, it calculates a fingerprint for a
pattern-length substring beginning at s. If a substring’s fingerprint is not equal to p, the
Page 13
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 6
substring will definitely not match the pattern making it a perfect heuristic for string
matching. It also has another advantage that helps speedup the comparison process.
Automaton Matcher Algorithm 1974: It is the first linear algorithm based on
deterministic automata, it scans the text character by character, from left to right,
performing transitions on the automaton. Classical/dynamic programming Method
Classical method as we mentioned earlier in exact string matching based on character
comparisons. Dynamic programming approach also is a classical solution that computes
the distance between strings.
2.2 DIFFERENT STRING MATCHING ALGORITHMS
1. RABIN KARP
2. KNUTH MORRIS PRATT
3. GALIL SEFERAS
4. BOYER MOORE
5. BERRY RAVINDRAN
6. SMITH
7. RAITA
8. HORSPOOL
9. BRUTE FORCE
10. SHIFT FOR
11. REVERSE COLUSSI
Page 14
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 7
12. QUICK SEARCH
13. REVERSE FACTOR
14. OPTIMAL MISMATCH
2.2.1 HORSPOOL ALGORITHM
The bad-character shift used in the Boyer-Moore algorithm is not very efficient for
small alphabets, but when the alphabet is large compared with the length of the
pattern, as it is often the case with the ASCII table and ordinary searches made under
a text editor, it becomes very useful.
Using it alone produces a very efficient algorithm in practice. Horspool proposed to
use only the bad-character shift of the rightmost character of the window to compute
the shifts in the Boyer-Moore algorithm.
2.2.2 SMITH ALGORITHM
The Smith–Waterman algorithm performs local sequence alignment; that is, for
determining similar regions between two strings or nucleotide or protein sequences.
Instead of looking at the total sequence, the Smith–Waterman algorithm compares
segments of all possible lengths and optimizes the similarity measure. The Smith–
Waterman algorithm (SWA) is fairly demanding of time: To align two sequences of
lengths m and n, O(mn) time is required. Smith–Waterman local similarity scores can
be calculated in O(m) (linear) space if only the optimal alignment needs to be found,
but naive algorithms to produce the alignment require O(mn) space
Page 15
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 8
2.2.3 BERRY RAVINDRAN
Berry and Ravindran designed an algorithm which performs the shifts by
considering the bad-character shift for the two consecutive text characters immediately to
the right of the window.
2.2.4 REVERSE COLUSSI
The character comparisons are done using a specific order given by a table h.
For each integer i such that 0 i m we define two disjoint sets:
Pos(i)={k : 0 k i and x[i] = x[i-k]}
Neg(i)={k : 0 k i and x[i] x[i-k]}
2.2.5 QUICK SEARCH ALGORITHM
The Quick Search algorithm uses only the bad-character shift table (see chapter
BoyerMoore algorithm). After an attempt where the window is positioned on the text factor
y[j .. j+m-1], the length of the shift is at least equal to one. So, the character y[j+m] is
necessarily involved in the next attempt, and thus can be used for the bad-character shift of
the current attempt.
2.2.6 OPTIMAL MISMATCH
The preprocessing phase of the Optimal Mismatch algorithm consists in sorting the pattern
characters in decreasing order of their frequencies and then in building the Quick Search
Page 16
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 9
bad-character shift function (see chapter Quick Search algorithm) and a goodsuffix shift
function adapted to the scanning order of the pattern characters. It can be done in O(m2+ )
time and O(m+ ) space complexity.
2.2.7 RAITA ALGORITHM
Raita algorithm searches for a pattern "P" in a given text "T" by comparing each character
of pattern in the given text. Searching will be done as follows. Window for a text "T" is
defined as the length of "P".
1. First, last character of the pattern is compared with the rightmost character of the
window.
2. If there is a match, first character of the pattern is compared with the leftmost
character of the window.
3. If they match again, it compares the middle character of the pattern with middle
character of the window.
2.2.8 REVERSE FACTOR ALGORITHM
The Reverse Factor algorithm parses the characters of the window from right to left with
the automaton S(xR), starting with state q0. It goes until there is no more transition defined
for the current character of the window from the current state of the automaton. At this
moment it is easy to know what is the length of the longest prefix of the pattern which has
been matched: it corresponds to the length of the path taken in S(xR) from the start state q0
Page 17
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 10
to the last final state encountered. Knowing the length of this longest prefix, it is trivial to
compute the right shift to perform.
2.3 SOFTWARE DESCRIPTION
2.3.1 JAVA
Java is a set of computer software and specifications developed by Sun Microsystems,
which was later acquired by the Oracle Corporation, that provides a system for developing
application software and deploying it in a crossplatform computing environment. Java is
used in a wide variety of computing platforms from embedded devices and mobile phones
to enterprise servers and supercomputers. While they are less common than standalone Java
applications, Java applets run in secure, sandboxed environments to provide many features
of native applications and can be embedded in HTML pages.
Writing in the Java programming language is the primary way to produce code that
will be deployed as byte code in a Java Virtual Machine (JVM); byte code compilers are
also available for other languages, including Ada, JavaScript, Python, and Ruby. In
addition, several languages have been designed to run natively on the JVM, including
Scala, Clojure and Groovy. Java syntax borrows heavily from C and C++, but object-
oriented features are modeled after Smalltalk and Objective-C.[11] Java eschews certain
low-level constructs such as pointers and has a very simple memory model where every
object is allocated on the heap and all variables of object types are references.
Memory management is handled through integrated automatic garbage
Page 18
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 11
collection performed by the JVM.
On November 13, 2006, Sun Microsystems made the bulk of its implementation
of Java available under the GNU General Public License (GPL).
PLATFORM
The Java platform is a suite of programs that facilitate developing and running programs
written in the Java programming language. A Java platform will include an execution
engine (called a virtual machine), a compiler and a set of libraries; there may also be
additional servers and alternative libraries that depend on the requirements. Java is not
specific to any processor or operating system as Java platforms have been implemented for
a wide variety of hardware and operating systems with a view to enable Java programs to
run identically on all of them.
2.3.2 ECLIPSE PLATFORM ARCHITECTURE
Eclipse uses plug-ins to provide all the functionality within and on top of the runtime
system. Its runtime system is based on Equinox, an implementation of the OSGi core
framework specification.
In addition to allowing the Eclipse Platform to be extended using other programming
languages, such as C and Python, the plug-in framework allows the Eclipse Platform to
work with typesetting languages like LaTeX[31] and networking applications such as telnet
and database management systems. The plug-in architecture supports writing any desired
extension to the environment, such as for configuration management. Java and CVS
Page 19
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 12
support is provided in the Eclipse SDK, with support for other version control systems
provided by third-party plug-ins.
With the exception of a small run-time kernel, everything in Eclipse is a plug-in.
This means that every plug-in developed integrates with Eclipse in exactly the same way
as other plug-ins; in this respect, all features are "created equal".[citation needed] Eclipse
provides plug-ins for a wide variety of features, some of which are through third parties
using both free and commercial models. Examples of plug-ins include for UML, for
Sequence and other UML diagrams, a plug-in for DB Explorer, and many others.
The Eclipse SDK includes the Eclipse Java development tools (JDT), offering an
IDE with a built-in incremental Java compiler and a full model of the Java source files. This
allows for advanced refactoring techniques and code analysis. The IDE also makes use of
a workspace, in this case a set of metadata over a flat filespace allowing external file
modifications as long as the corresponding workspace "resource" is refreshed afterwards.
Eclipse implements the graphical control elements of the Java toolkit called SWT, whereas
most Java applications use the Java standard Abstract Window Toolkit (AWT) orSwing.
Eclipse's user interface also uses an intermediate graphical user interface layer called JFace,
which simplifies the construction of applications based on SWT. Eclipse was made to run
on Wayland during a GSoC-Project in 2014.
Page 20
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 13
Fig 2.3 Java Perspective.
CHAPTER 3
Page 21
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 14
REQUIREMENT ANALYSIS 3.1
FUNCTIONAL REQUIREMENTS
The following are the parameters to be analyzed:
• Execution time
• Complexity
• Total number of comparisions
• Line of code
• Pre-processing time
3.1.1 EXECUTION TIME
The execution time or CPU time of a given task is defined as the time spent by the system
executing that task, including the time spent executing run-time or system services on its
behalf. The mechanism used to measure execution time is implementation defined. It is
implementation defined which task, if any, is charged the execution time that is consumed
by interrupt handlers and run-time services on behalf of the system.
The type CPU_Time represents the execution time of a task. The set of values of this type
corresponds one-to-one with an implementation-defined range of mathematical
integers.
CPU_Time_Start and CPU_Time_End are the smallest and largest values of the
CPU_Time type, respectively.
Page 22
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 15
Algorithm Execution time
NAÏVE O((n-m+1)m)
KNUTH MORRIS PRATT O(m+n)
FINITE AUTOMATA O(m* NO_OF_CHARS)
RABIN KARP O(m+n)
Table 3.1 EXECUTION TIME OF ALGORITHMS
3.1.2 COMPLEXITY:
Algorithmic complexity is concerned about how fast or slow particular algorithm
performs. We define complexity as a numerical function T(n) - time versus the input size
n. We want to define time taken by an algorithm without depending on the implementation
details. But you agree that T(n) does depend on the implementation! A given algorithm will
take different amounts of time on the same inputs depending on such factors as: processor
speed; instruction set, disk speed, brand of compiler and etc. The way around is to estimate
efficiency of each algorithm asymptotically. We will measure time T(n) as the number of
elementary "steps" (defined in any way), provided each such step takes constant time.
Page 23
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 16
Let us consider two classical examples: addition of two integers. We will add two integers
digit by digit (or bit by bit), and this will define a "step" in our computational model.
Therefore, we say that addition of two n-bit integers takes n steps. Consequently, the total
computational time is T(n) = c * n, where c is time taken by addition of two bits. On
different computers, additon of two bits might take different time, say c1 and c2, thus the
additon of two n-bit integers takes T(n) = c1 * n and T(n) = c2* n respectively. This shows
that different machines result in different slopes, but time T(n) grows linearly as input size
increases.
3.1.3 TOTAL NUMBER OF COMPARISIONS:
It gives the count of the comparisons made between a text and a pattern. In other words,
the pattern is compared with each and every string or a line of text. This is done both
logically and physically.
3.1.4 LINE OF CODE:
Line of code (LOC) is a software metric used to measure the size of the computer program
by counting the number of lines in the text of the program’s source code. It is typically used
to predict the amount of effort that will be required to develop a program, as well as to
estimate programming productivity or maintainability once the software is produced.
There are two major types of LOC:
• Physical LOC
Page 24
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 17
• Logical LOC
Specific definitions of these two measures vary, but the most common definition of physical
LOC is a count of lines in the text of the program’s source code excluding the comment
lines.
Logical LOC attempts to measure the number of executable ―statements‖, but their
specific definitions are tied to specific computer languages. It is much easier to create tools
that measure physical LOC and physical LOC definitions are easier to explain.
However, physical LOC measures are sensitive to logically irrelevant formatting and style
conventions, while logical LOC is less sensitive to formatting and style conventions.
However, LOC measures are often stated without giving their definition, and logical LOC
can often be significantly different from physical LOC.
3.1.5 PRE-PROCESSING TIME:
The following table shows the pre-processing time of different algorithms:
Algorithm Pre-processing time
Page 25
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 18
NAÏVE 0
KNUTH MORRIS PRATT O(m)
FINITE AUTOMATA O(m*n)
RABIN KARP O(m)
Table 3.2 Pre-processing time
3.2 HARDWARE REQUIREMENTS
Processor : Any Processor above 500 MHz
RAM : 1 TB
Hard Disk : 4 GB
Input device : Standard Keyboard and Mouse
Output device : High Resolution Monitor
3.3 SOFTWARE REQUIREMENTS
• Operating system : Windows 10
• Front End : Eclipse 4.5 Mars in the Java EE perspective
Page 26
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 19
• Platform : Java SE, Standard Widget Toolkit
• Type : Integrated Development Environment(IDE)
• Server : Internet Information Services
CHAPTER 4
Page 27
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 20
DESIGN
4.1 DESIGN GOALS
4.1.1 NEED OF PATTERN MATCHING
Pattern matching is the process of checking a perceived sequence of string for the presence
of the constituents of some pattern. In contrast to pattern recognition, the match usually has
to be exact. The patterns generally have the form sequences of pattern matching include
outputting the locations of a pattern within a string sequence, to output some component of
the matched pattern, and to substitute the matching pattern with some other string sequence
(i.e., search and replace). Pattern matching concept is used in many applications Following
figure shows the different applications.
Fig4.1 Applications of pattern matching
4.1.2 NOTATION USED BY THE ALGORITHM
Page 28
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 21
NOTATION USED FOR
N Length of the text
M Length of the pattern(string)
C The size of the alphabet
Table 4.1 Algorithm notation.
4.2 ALGORITHM TECHNIQUES
Every algorithm uses some special techniques to find pattern matching. Following table
shows the different techniques used by different algorithms.
ALGORITHMS TECHNIQUES
NAÏVE STRING MATCHING
ALGORITHM
Each character of the pattern is compared
to a substring of the text which is the
length of the pattern, until there is a
mismatch or a match.
KNUTH-MORRIS PRATT STRING
MATCHING ALGORITHM
Two indices l and r into text string t
FINITE AUTOMATA STRING
MATCHING ALGORITHM
Subset construction
RABIN-KARP STRING MATCHING
ALGORITHM
Hashing
Table 4.2 Algorithm techniques.
Page 29
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 22
4.3 ALGORITHMS
4.3.1 NAÏVE STRING MATCHING ALGORITHM
NAIVE_STRING_MATCHER(T,P)
1. n length[T]
2. m length[P]
3. for s 0 to n-m do
4. if P[1…..m]=T[s+1…..s+m]
5. then print ―Pattern occurs with shift‖ s
4.3.2 KNUTH-MORRIS PRATT STRING MATCHING
ALGORITHM
KMP_MATCHER(T,P)
1. n length[T]
2. m length[P]
3. ∏ KMP-PREFIX(P)
4. i 0
5. for j 1 to n do
6. while i > 0 and P[i+1] ≠ T[i]
Page 30
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 23
7. do i ∏[i]
8. if P[i+1] = T[j]
9. then i i+1
10. if i = m
11. then print ―Pattern occurs with shift‖ j-m
12. i ∏[i]
KMP-PREFIX(P)
1. m length[P]
2. ∏[1] 0
3. i 0
4. for j 2 to m do
5. while i > 0 and P[i+1] ≠ P[j]
6. do i ∏[i]
7. if P[i+1] = P[i]
8. then i i+1
9. ∏[j] i
10. return ∏
Page 31
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 24
4.3.3 FINITE AUTOMATA STRING MATCHING ALGORITHM
DFAPatternSearch (pattern,text)
1. nextState = computeNextStateFunction(pattern)
2. currentState = 0
3. for textCursor =0 to n-1
4. character = text[textCursor]
5. currentState = nextState[currentState][character]
6. if currentState = m
7. return (textCursor – m+1)
8. endif
9. endfor
10. return -1 computeNextStateFunction (pattern)
1. for state=0 to m-1
2. for character=smallestChar to largestChar
3. patternPlusChar = concatenate (pattern,character)
4. k = length of longest suffix of patternPlusChar that is a prefix of pattern
5. nextState[state][character]=k
6. endfor
7. endfor
4.3.4 RABIN KARP STRING MATCHING ALGORITHM
RABIN-KARP-MATCHER (T, P , d , q)
Page 32
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 25
1. n length[T]
2. m length[P]
3. h d^(m-1) mod q
4. p 0
5. to 0
6. for i 0 to m
7. do p (d*p + P[i]) mod q
8. to (d*to + T[i]) mod q
9. for s 0 to n-m
10. do if p = ts
11. then if P[1….m] = T[s+1....s+m]
12. then print ―Pattern occurs with shift‖ s
13. if s < n - m
14. then ts+1 (d(ts – T[s+1])h) + T[s+m+1] mod q
4.4 APACHE POI
Helps in creating a spreadsheet and manipulate it using Java. Spreadsheet is a page in an
Excel file; it contains rows and columns with specific names.
Page 33
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 26
Operations performed are:
1) Create a blank workbook
2) create a blank spreadsheet
3) create row
4) insert cell values
5) close
4.5 GRAPH SNIPPET
HSSFWorkbook workbook = new HSSFWorkbook();
HSSFSheet sheet = workbook.getSheet("FirstSheet");
HSSFRow rowhead = sheet.createRow((short)0);
rowhead.createCell(0).setCellValue("run tym"); HSSFRow
row = sheet.createRow((short)1);
row.createCell(0).setCellValue(extime);
FileOutputStream fileOut = new FileOutputStream(filename);
workbook.write(fileOut); fileOut.close();
CHAPTER 5
Page 34
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 27
IMPLEMENTATION
5.1 DATA SET
A data set is a collection of related, discrete items of related data that may be accessed
individually or in combination or managed as a whole entity. A Data Set can be database
or a normal text file. Here in our project we are using a Text File as a data set.
Functions Used To Import Data from a Text File:
BufferedReader class reads text from a character-input stream, buffering characters so as
to provide for efficient reading of characters, arrays, lines. The Buffer size may be specified
or may be default size. The default is large enough for most purposes. Here is how it looks:
BufferedReader bufferedReader= new BufferedReader(new FileReader(“my file”);
SCANNER is a class which allows the user to read values of various types. The scanner
looks for token in the input. A token is a series of characters that ends with what java calls
whitespace. Scanner breaks its input into tokens using a delimiter pattern, which by default
matches whitespace. A simple text scanner which can parse primitive types and strings
using regular expression. Scanner class belongs to java.util package.
There are two constructors that are particularly useful: one takes an InputStream object as
a parameter and the other takes a FileReader object as a parameter.
Scanner in= new Scanner(System.in) ; //system.in is input stream
Page 35
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 28
Scanner in File=new Scanner(new FileReader(“myFile”));
Fig 5.1 Block diagram to show reader and writer function
5.1.1 INPUT TEXT FILE
A Walk to Remember
NICHOLAS SPARKS
Page 36
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 29
Prologue When I was seventeen, my life changed forever. I know that there are people
who wonder about me when I say this. They look at me strangely as if trying to
fathom what could have happened back then, though I seldom bother to explain.
Because I've lived here for most of my life, I don't feel that I have to unless it's on my
terms, and that would take more time than most people are willing to give me. My
story can't be summed up in two or three sentences; it can't be packaged into
something neat and simple that people would immediately understand. Despite the
passage of forty years, the people still living here who knew me that year accept my
lack of explanation without question. My story in some ways is their story because it
was something that all of us lived through. It was I, however, who was closest to it.
I'm fifty-seven years old, but even now I can remember everything from that year,
down to the smallest details. I relive that year often in my mind, bringing it back to
life, and I realize that when I do, I always feel a strange combination of sadness and
joy. There are moments when I wish I could roll back the clock and take all the
sadness away, but I have the feeling that if I did, the joy would be gone as well. So I
take the memories as they come, accepting them all, letting them guide me whenever I
can. This happens more often than I let on. It is April 12, in the last year before the
millennium, and as I leave my house, I glance around. The sky is overcast and gray,
but as I move down the street, I notice that the dogwoods and azaleas are blooming.
Old Hegbert, he'd stop dead in his tracks and his ears would perk up-I swear to God,
they actually moved-and he'd turn this bright shade of red, like he'd just drunk
gasoline, and the big green veins in his neck would start sticking out all over, like
Page 37
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 30
those maps of the Amazon River that you see in National Geographic. He'd peer from
side to side, his eyes narrowing into slits as he searched .
SAMPLE INPUT TEXT FILE
5.2 GRAPHICAL USER INTERFACE:
We have used JCombo Box for the user interaction. The class JComboBox is a component
which combines a button or editable field and a drop-down list.
Following is the declaration for javax.swing.JComboBox class −
public class JComboBox extends JComponent
implements ItemSelectable, ListDataListener,
ActionListener, Accessible
Page 38
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 31
Fig 5.2 block diagram to show a view of JCombo Box
Analysis:
We have graphically analyzed the project using a excel sheet.
CHAPTER 6
Page 39
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 32
TESTING
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, sub-assemblies, assemblies and/or a finished product it is the
process of exercising software with the intent of ensuring that the Software system meets
its requirements and user expectations and does not fail in an unacceptable manner. There
are various types of test. Each test type addresses a specific testing requirement.
TYPES OF TESTS
6.1 UNIT TESTING
Unit testing involves the design of test cases that validate that the internal program
logic is functioning properly, and that program inputs produce valid outputs. All decision
branches and internal code flow should be validated. It is the testing of individual software
units of the application .it is done after the completion of an individual unit before
integration. This is a structural testing, that relies on knowledge of its construction and is
invasive. Unit tests perform basic tests at component level and test a specific business
process, application, and/or system configuration. Unit tests ensure that each unique path
of a business process performs accurately to the documented specifications and contains
clearly defined inputs and expected results.
Page 40
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 33
6.2 INTEGRATION TESTING
Integration tests are designed to test integrated software components to determine if they
actually run as one program. Testing is event driven and is more concerned with the basic
outcome of screens or fields. Integration tests demonstrate that although the components
were individually satisfaction, as shown by successfully unit testing, the combination of
components is correct and consistent. Integration testing is specifically aimed at exposing
the problems that arise from the combination of components.
6.3 VALIDATION TESTING
An engineering validation test (EVT) is performed on first engineering prototypes,
to ensure that the basic unit performs to design goals and specifications. It is important in
identifying design problems, and solving them as early in the design cycle as possible, is
the key to keeping projects on time and within budget. Too often, product design and
performance problems are not detected until late in the product development cycle — when
the product is ready to be shipped. The old adage holds true: It costs a penny to make a
change in engineering, a dime in production and a dollar after a product is in the
field.
Verification is a Quality control process that is used to evaluate whether or not a
product, service, or system complies with regulations, specifications, or conditions imposed
at the start of a development phase. Verification can be in development, scaleup, or
production. This is often an internal process.
Page 41
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 34
Validation is a Quality assurance process of establishing evidence that provides a
high degree of assurance that a product, service, or system accomplishes its intended
requirements. This often involves acceptance of fitness for purpose with end users and other
product stakeholders.
The testing process overview is as follows:
Fig 6.1 The testing process
6.4 SYSTEM TESTING
System testing of software or hardware is testing conducted on a complete,
integrated system to evaluate the system's compliance with its specified requirements.
System testing falls within the scope of black box testing, and as such, should require no
knowledge of the inner design of the code or logic.
Page 42
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 35
As a rule, system testing takes, as its input, all of the "integrated" software
components that have successfully passed integration testing and also the software system
itself integrated with any applicable hardware system(s).
System testing is a more limited type of testing; it seeks to detect defects both
within the "inter-assemblages" and also within the system as a whole.
System testing is performed on the entire system in the context of a Functional
Requirement Specification(s) (FRS) and/or a System Requirement Specification (SRS).
System testing tests not only the design, but also the behavior and even the believed
expectations of the customer. It is also intended to test up to and beyond the bounds defined
in the software/hardware requirements specification(s).
6.5 TESTING OF INITIALIZATION AND GUI COMPONENTS
GUIs are tested manually, often by the developers themselves. This is very unreliable and
expensive. For new GUIs or those being significantly changed, quality is low, and failures
at integration time or during user acceptance tests are common.
Screen Scraper/replay based GUI test techniques are adopted. In a few months, someone
notices that these don't work, though they are beloved by managers looking for cheap
solutions. The problem is that every time you change the screen layout all existing tests
become useless, which means you have no regression tests.
Page 43
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 36
Serial Number of Test Case TC 01
Module Under Test DATASET
Description
When the program is executed any word
or phrase of user’s choice will be searched
from a text file (data set) along with
analysis of time complexity,
preprocessing time, line of code, Number
of comparisons and execution time.
Output If the pattern is present in the data set
match found along with analyzed
parameters else match not found.
Remarks Test Successful.
Table 6.1: Test case when input is taken from a Data set (text file)
Page 44
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 37
Serial Number of Test Case TC 01
Module Under Test USER INPUT
Description
When the program is executed, User can
give his input as text and thus a given
pattern will be searched from given text
along with analysis of time complexity,
preprocessing time, line of code, Number
of comparisons and execution time.
Output
If the pattern is present in the text, match
found along with analyzed parameters else
match not found.
Remarks Test Successful.
Table 6.2: Test case when input given by user
Page 45
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 38
CHAPTER 7
SNAPSHOTS
7.1 OUTPUT OF NAÏVE ALGORITHM
Page 46
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 39
Fig 7.1 Output of Naïve algorithm.
7.2 OUTPUT OF KNUTH-MORRIS PRATT ALGORITHM
Fig 7.2 Output of Knuth Morris Pratt algorithm.
Page 47
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 40
7.3 OUTPUT OF FINITE AUTOMATA ALGORITHM
Fig 7.3 Output of Finite automata algorithm.
Page 48
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 41
7.4 OUTPUT OF RABIN-KARP ALGORITHM
Fig 7.4 Output of Rabin Karp algorithm.
Page 49
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 42
7.5 COMPARISION OF ALGORITHMS OUTPUT
Fig 7.5.1 Output of algorithm comparison parameters.
Page 50
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 43
Fig 7.5.1 Output of algorithm comparison parameters.
Page 51
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 44
CHAPTER 8
CONCLUSION AND FUTURE ENHANCEMENT
8.1 CONCLUSION
String matching is the problem of finding all occurrences of a character pattern in a text.
This project provides an overview of different string matching algorithms and comparative
study of these algorithms. In this project, we have evaluated several algorithms like Naïve
string matching algorithm, Knuth Morris Pratt String matching algorithm, Rabin Karp
string matching algorithm and Finite automata String matching algorithm.
String matching algorithm plays the vital role in the Computational Biology. The functional
and structural relationship of the biological sequence is determined by similarities on that
sequence. For that, the researcher is supposed to aware of similarities on the biological
sequences. Pursuing of similarity among biological sequences is an important research area
of that can bring insight into the evolutionary and genetic relationships among the genes.
In this paper, we have studied different kinds of string matching algorithms and observed
their time complexity, Number of comparisons, Line of Code, Execution time and
Preprocessing time.
From the studying, it is analyzed that KMP algorithm relatively easier to implement
because never needs to move backwards in the input sequence. Rabin Karp algorithm used
Page 52
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 45
to detect the plagiarism. Naive algorithm do not require preprocessing of the text or the
pattern, the problem is that it’s very slow, it rarely produces efficient result.
The string-matching automaton is very efficient:, it examines each character in the text
exactly once and reports all the valid shifts. Finite automata string matching algorithm is
faster as well, compared to other algorithms.
Innovation and creativity in string matching can play an immense role for getting time
efficient performance in various domains of computer science.
8.2 FUTURE ENHANCEMENT
1. Space Complexity can be determined.
2. Algorithms can be applied to size more than the maximum size.
3. Connecting program to web or using Internet as a dataset.
4. Graphical User Interface can be built user friendly.
Page 53
Analysis of Different String Matching Algorithms
Dept of CSE,NHCE 46
REFERENCES
Good Teachers are worth more than thousand books, we have them in Our Department.
1. String Matching Methodologies: A Comparative Analysis by Akhtar Rasool,Amrita
Tiwari, Gunjan Singla,Nilay Khare Department of computer Science & Engg.
Maulana Azad National Institute of Technology.
2. A Comparative Study On String Matching Algorithms Of Biological Sequences by
Pandiselvam.P, Marimuthu.T, Lawrance. R Department of Computer Applications,
Ayya Nadar Janaki Ammal College.
3. Algorithms for String matching by- Marc GOU.
4. Handbook of Exact String-Matching Algorithms by- Christian Charras Thierry
Lecroq.