PROJECT REPORT CONTENT BASED SEARCH BY RETRIVING THE FILES A thesis submitted in partial fulfillment of the requirement for the Award of Degree Of BACHELOR OF TECHNOLOGY(Computer Sciences) NIMRA COLLEGE OF ENGINEERING AND TECHNOLOGY VIJAYAWADA Zulfikar Ali.Md (06231A05C0) 1
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
PROJECT REPORT
CONTENT BASED SEARCH BY RETRIVING THE FILES
A thesis submitted in partial fulfillment of the requirement for the Award of Degree Of
BACHELOR OF TECHNOLOGY(Computer Sciences)
NIMRA COLLEGE OF ENGINEERING AND TECHNOLOGY VIJAYAWADA Zulfikar Ali.Md (06231A05C0) Sudeesha.M Ritesh Abhishekh (06231A05A4) (06231A0570)
Sajida Bhanu Pramod.G (06231A0572) (06231A0562)
Under the esteemed guidance of
Miss. G.ANITHA, B.Tech (CSE)Lecturer
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
NIMRA COLLEGE OF ENGINEERING AND TECHNOLOGY
(Affiliated to Jawaharlal Nehru Technological University)
1
PROJECT REPORT
AN ISO 9001-2000 CERTIFIED INSTITUTION
JUPUDI, VIJAYAWADA, AP.MAY, 2009
ABSTRACT
Content Based File Search is a Java application to find files that
contain (or don’t contain) a given string. The string may be in plain
text or it may be a Java regular expression. Such a trivial search
should be part of the operating system, and in fact, once was. As
bigger and more impressive features were added to Windows, it lost
the ability to search files for arbitrary bytes of text. Windows
98/ME/2000 could find words buried in files with unknown formats;
Windows XP and Vista search only their supported file types. Through
the creation of files of content through applications, downloading of
content from the Internet, or receiving content via email, this file
system can become quite full of important content located throughout
the system. Whether these files are carefully filed away in deeply
nested hierarchical folders, or haphazardly filed away in a nearly flat
system, at some point that data probably needs to be accessed again.
It is at this point the problem of desktop search becomes apparent. In
a system consisting of gigabytes and gigabytes of thousands or even
millions of files, it is important have more efficient search engine for
desktop. The speed of this program depends upon the speed of
computer’s hardware and the complexity of the search string. When
searching for plain ASCII text or Unicode characters from 0x20 to
0x7E, the “(raw data bytes)” encoding is about 40% faster than the
local system’s in “(default encoding)”.
2
PROJECT REPORT
Introduction
The capacity of our hard-disk drives has in creased
tremendously over the past decade, and so has the number of files we
usually store on our computer. It is no wonder that sometimes we
cannot find a document any more, even when we know we saves it
some where. The recent arrival of desktop search applications, which
index all data on a PC, promises to increase search efficiency on the
desktop. Still, these search applications are weaker than their web
counterparts. Unfortunately, they also fall short of utilizing desktop
specific characteristics, especially context information. For example,
one file might contain a question describing the object one is looking
for, and another file in the same thread might include the answer to
that question in the form of an attached document. The search
functionality in earlier versions of Windows searches all files for the
specified string and may return a large number of irrelevant files such
as program and configuration files. As a result of this change, the
search functionality can find the same set of files if the Content Index
service is turned on or off. In previous versions of Windows, the
computer exhibited different behavior if you turned on the Content
Index service.
Content-based File retrieval was initially proposed to overcome
the difficulties encountered in keyword-based File search in 1990s.
Since then, it has been an active research topic, and a lot of
algorithms have been published in the literature. In keyword-based
search,file have to be manually annotated with keywords. As keyword
annotation is a tedious process, it is impractical to annotate.
Furthermore, annotation may be inconsistent. Moreover, the feature
extraction can be performed automatically. Thus, the human labeling
process can be avoided.
3
PROJECT REPORT
Context and Content
This metric brings about two points. First, the context of the
search - what documents and text you have open or have recently
modified - could help immensely, and since this is search done on a
local computer that information could be accessible. Second, it points
out that a text-based keyword search may not be the whole answer. A
content-based information retrieval system that allows you to
construct search queries based on the kind of content you're searching
for could be an important area for research. This isn't the best
example, but rather than just searching for a company name in your
email to find correspondence with members of that company, if you
have one email from that company the fact that all email from that
company will be from the same domain name is something your
search tool could notice. It might rank email to and from that specific
person as most relevant, email to and from that company as also
relevant. When you think of your documents and content as query
statements, interesting possibilities open up.
When the domain switches from email to media - like music or
images - the possibilities for content-based image retrieval seem even
more interesting. Especially considering the relatively impoverished
state of metadata, text-based searching for media content on the
desktop is extremely difficult.
Existing System
4
PROJECT REPORT
Through the creation of files of content through applications,
downloading of content from the Internet, or receiving content via
email, this file system can become quite full of important content
located throughout the system. Whether these files are carefully filed
away in deeply nested hierarchical folders, or haphazardly filed away
in a nearly flat system, at some point that data probably needs to be
accessed again. It is at this point the problem of desktop search
becomes apparent. In a system consisting of gigabytes and gigabytes
of thousands or even millions of files, how does one locate a specific
file? If it is filed away "properly," that is, in a manner the user was
conscious of and remembers, perhaps it will be easily located in that
folder. But what if the user has put the file in a folder he can't
remember? Or software automatically saved it somewhere he does not
expect? Or the folder it is in contains over a hundred files, and the
user can't remember the file's name? Or he knows the folder it is in,
but can't remember where the folder is? There are many reasons to
not be able to instantly remember the folder location of a file,
especially if it was created months or even years earlier.
Disadvantages
Speed is a major issue. By default, neither file metadata nor content is
indexed in such a way that results are returned quickly. Although
Windows XP includes something called "Indexing Service" that will
index files for quick access, it is not enabled by default. It was not
examined for the purposes of this paper since it is so seldom used or
mentioned by normal users. There is no meaningful ranking of the
results. That is, although you can resort the results by the common file
system metadata: name, folder location, file type, and date modified,
results seemed to be returned simply in the order they are found as
Windows XP Search linearly searches through files and folders.
5
PROJECT REPORT
Proposed System
The string may be in plain text or it may be a Java regular
expression. Such a trivial search should be part of the operating
system, and in fact, once was. As bigger and more impressive
features were added to Windows, it lost the ability to search files for
arbitrary bytes of text. Windows 98/ME/2000 could find words buried
in files with unknown formats; Windows XP and Vista search only their
supported file types. A regular expression is a way of specifying
relationships between elements of a complex pattern. You don’t need
to understand regular expressions to use this program. This program
can be executed from both the command prompt and the graphical
user interface. As we implement the regular expression we can over
come the disadvantages of the previous system.
System Specifications
Hardware Specification:
6
PROJECT REPORT
The speed of this program depends upon the speed of your
computer’s hardware. When searching for plain ASCII text or Unicode
characters from 0x20 to 0x7E, the “(raw data bytes)” encoding is
about 40% faster than the local system’s “(default encoding)”. Even
an old Intel Pentium 3 processor at 3.0 GHz should be able to scan
large files at 15 megabytes per second (MB/s) as raw data bytes with
the “case” option enabled.
PROCESSOR Pentium Series
RAM 64 MB
KEY BOARD 104 Keys
FLOPPY DISK 1.44 MB
HARD DISK 6 GB
MOUSE Serial Mouse
Software Specification:
FileSearch was developed with Java 1.4 and should run on later
versions. It may also run on earlier versions, but this has not been
tested. For Macintosh computers, the version of Java is determined by
7
PROJECT REPORT
your version of MacOS. For Windows, Linux, and Solaris, you can
download the JRE from Sun Microsystems:
Sun Java
JRE for end users: http://www.java.com/getjava/
SDK for programmers: http://developers.sun.com/downloads/
IDE for programmers: http://www.netbeans.org/
As the application is developed using the java technology, for
compiling the project we need the java installed on the system but in
order to run the project we just need JVM installed in system. Now a
day most of the operating system is installed with JVM inbuilt. If we
don’t found the JVM on system we can download from the Sun
Microsoft web sites as it is free to download. Once we load the JVM we
can run the application. As per this project development and executing