A Project Report On
COMPRESSION & DECOMPRESSION
Submitted in partial fulfillment of the requirement for theAward of the degree of
Bachelor of TechnologyIn
Information Technology
By
RAHUL SINGH SHAKUN GARG
0407713057, 0407713042
Dr. K.N.MODI INSTITUTE OF ENGINEERING & TECHNOLOGY
Approved by A.I.C.T.E. Affiliated to U. P. Technical University, Lucknow
Modianagar – 201204, (Batch: 2004-2008)
1 1
CONTENTS
ACKNOWLEDGEMENT 4
CERTIFICATE 5
LIST OF TABLES 6
LIST OF FIGURES 6
ABSTRACT 7-13
SYNOPSIS OF THE PROJECT 14-18
1 OBJECTIVE 14 2 SCOPE 14
3 DESIGN PRINCIPLE & EXPLANATION 16-17
3.1 Module Description
3.1.1 Huffman Zip
3.1.2 Encoder
3.1.3 Decoder
3.1.4 Table
3.1.5 DLNode
3.1.6 Priority Queue
3.1.7 Huffman Node
4 HARDWARE & SOFTWARE REQUIREMENTS 18
2 2
MAIN REPORT 20-118
1 Objective & Scope of the Project 20
2 Theoretical Background 23
2.1 Introduction 23
2.2 Theory 23
2.3 Definition 24-35
2.3.1 Lossless vs Lossy Compression
2.3.2 Image Compression
2.3.3 Video Compression
2.3.4 Text Compression
2.3.4 LZW Algorithm
3 Problem Statement 36
4 System analysis and design 39
4.1 Analysis 39
4.2 Design 40-48
4.2.1 System design
4.2.2 Design objective
4.2.3 Design principle
5 Stages in System Life Cycle 49
5.1 Requirement Determination 49
5.2 Requirement Specifications 49
5.3 Feasibility Analysis 50
5.4 Final Specification 50
5.5 Hardware Study 51
5.6 System Design 51
5.7 System Implementation 52
5.8 System Evaluation 52
5.9 System Modification 52
5.10 System Planning 53
6 Hardware & Software Requirement 60
7 Project Description 61
7.1 Huffman Algorithm 61
3 3
7.2 Code Construction 68
7.3 Huffing Program 68
7.4 Building Table 69
7.5 Decompressing 70
7.6 Transmission & storage of Huffman encoded data 72
8 Working of Project 73
8.1 Module & their description 73
9 Data Flow Diagram 75
10 Print Layouts 82
11 Implementation 85
12 Testing 87
12.1 Test plan 87
12.2 Terms in testing fundamentals 88
13 Conclusion 94
14 Future Enhancement & New Direction 95
14.1 New Direction 95
14.2 Scope of future work 96
14.3 Scope of future application 96
15 Source Code 97-118
16 References 119
4 4
ACKNOWLEDGEMENT
Keep away from people who try to belittle your ambitions. Small people always do that,
but the really great make you feel that you too, can become great.
We take this opportunity to express my sincere thanks and deep gratitude to all
those people who extended their wholehearted co-operation and have helped me in
completing this project successfully.
First of all, we would like to thank Mr. Gaurav Vajpai (Project Guide) for his
strict supervision, constant encouragement, inspiration and guidance, which ensure the
worthiness of my work. Working under him was an enrich experience. His inspiring
suggestions and timely guidance enabled us to perceive the various aspects of the project in a
new light.
We would also thank to Head Dept. of IT, Prof. Jaideep Kumar,H.O.D., who
guided us a lot in completing this project. We would also like to thank my parents & project
mate for guiding and encouraging me throughout the duration of the project.
We will be failing in our mission if we do not thank other people who directly or
indirectly helped us in the successful completion of this project. So, our heartfull thanks to
all the teaching and non- teaching staff of computer science and engineering department of
our institution for their valuable guidance throughout the working of this project.
RAHUL SINGH
SHAKUN GARG
MANISH SRIVASTAVA
5 5
Dr. K.N. Modi Institute of Engineering and Technology Modinagar
Affiliated to UP Technical University, Lucknow
DEPARTMENT OF INFORMATION TECHNOLOGY
CERTIFICATE
This is to certify that RAHUL SINGH (0407713057), SHAKUN GARG (0407713042) and
MANISH SRIVASTAVA (0407713021) of the final year B. Tech. (IT) have carried out a
project work on “COMPRESSION & DECOMPRESSION” under the guidance of
Mr. GAURAV VAJPAI in Department IT for the partial fulfillment of the award of the
degree of Bachelor of Technology in Computer Science & Engineering in Meerut Institute of
Engineering & Technology, Meerut (Affiliated to U.P. Technical University, Lucknow) is a
bonafide record of work done by them during the year 2007 – 2008.
Head of the Department: Internal Guide:
(Mr. JAIDEEP KUMAR) Mr. GAURAV VAJPAI
Head, Department of IT
6 6
LIST OF TABLES
Table No. Table Name Page No.
1
FILE TABLE
2 DETAIL TABLE
LIST OF FIGURES
Figure No. Figure Name Page No.
1. ARCHITECTURE OF NETPOD 19
2 PERT CHART 36
3. GANTT CHART 38
7 7
COMPRESSION & DECOMPRESSION
STATEMENT ABOUT THE PROBLEM
In today’s world of computing, it is hardly possible to do without graphics, images
and sounds. Just by looking at the applications around us, the Internet, development of Video
CDs (Compact Disk) ,Video Conferencing, and much more, all these applications use
graphics and sound intensively.
I guess many of us have surfed the Internet, have you ever become so frustrated in waiting
for a graphics intensive web page to be opened that you stopped the transfer … I bet you
have. Guess what will happened if those graphics are not compressed?
Uncompressed graphics, audio And video data consumes very large amount of physical
storage which for the case of uncompressed video, even present CD technology is unable to
handle.
WHY IS THE PARTICULAR TOPIC CHOSEN?
Files available for transfer from one host to another over a network (or via modem) are often
stored in a compressed format or some other special format well-suited to the storage medium
and/or transfer method. There are many reasons for compressing/archiving files. The more
common are:
8 8
File compression can significantly reduce the size of a file (or group of files). Smaller files
take up less storage space on the host and less time to transfer over the network, saving both
time and money
OBJECTIVE AND SCOPE OF THE PROJECT
The objective of this system is to compress and decompress files. This system will be used to
compress files , so that they may take less memory for storage and transmission from one
computer to another. This system will work in following ways:
To compress a text and image file using Huffman coding.
To decompress the compressed file to original format.
To show the compression ratio.
Our project will be able to Compress message in such a form that can be easily transmitted
over the network or from one system to another. At the receiver end after decompressing the
message receiver will get the original message. This is how effective transmission of data that
take place between sender and receiver.
Reusability:
Reusability is possible as and when we require in this application. We can update it next
version. Reusable software reduces design, coding and testing cost by amortizing effort over
several designs. Reducing the amount of code also simplifies understanding, which increases
9 9
the likelihood that the code is correct. We follow up both types of reusability: Sharing of
newly written code within a project and reuse of previously written code on new projects.
Extensibility:
This software is extended in ways that its original developers may not expect. The following
principles enhance extensibility like Hide data structure, Avoid traversing multiple links or
methods, Avoid case statements on object type and distinguish public and private operations.
Robustness:
Its method is robust if it does not fail even if it receives improper parameters. There are some
facilities like Protect against errors, Optimize after the program runs, validating arguments
and Avoid predefined limits.
Understandability :
A method is understandable if someone other than the creator of the method can understand
the code (as well as the creator after a time lapse). We use the method, which small and
coherent helps to accomplish this.
Cost-effectiveness:
Its cost is under the budget and make within given time period. It is desirable to aim for a
system with a minimum cost subject to the condition that it must satisfy all the requirements.
Scope of this document is to put down the requirements, clearly identifying the
information needed by the user, the source of the information and outputs expected
from the system
METHODOLOGY ADOPTED
The methodology used is “the classic Life-cycle” model the “WATERFALL MODEL”
1010
HARDWARE & SOFTWARE REQUIREMENTS
HARDWARE SPECIFICATIONS:
Processor Pentium- I\II\III\higher
1111
Ram 128 MB RAM or higher
Monitor 15” Inch (Digital) with 800 X 600 support
Keyboard 101 Keys keyboard
Mouse 2 Button Serial/ PS-2
Tools / Platform Language Used:
Language: Java
OS: Any OS such as Windows XP/98/NT/Vista
TESTING TECHNOLOGIES
Some of the commonly used Strategies for Testing are as follows:-
1212
Unit testing
Module testingIntegration testingSystem testingAcceptance testing
UNIT TESTING
Unit testing is the testing of a single program module in an isolated environment. Testing of
the processing procedures is the main focus
MODULE TESTING
A module encapsulates related component. So can be tested without other system modules.
INTEGERATION TESTING
Integration testing is the testing of the interface among the system modules. In other words it
ensures that the module is handles as intended.
.
SYSTEM TESTING
1313
System testing is the testing of the system against its initial objectives. It is done either in a
simulated environment or in a live environment.
ACCEPTANCE TESTING
Acceptance Testing is performed with realistic data of the client to demonstrate that the
software is working satisfactorily. Testing here is focused on external behavior of the system;
the internal logic of program is not emphasized.
WHAT CONTRIBUTION WOULD THE PROJECT MAKE?
The contributions of COMPRESSION & DECOMPRESSION are as follows:
Compression is useful because it helps reduce the consumption of expensive
resources, such as hard disk space or transmission bandwidth .
It involve trade-offs between various factors, including the degree of compression,
the amount of distortion introduced (if using a lossy compression scheme), and the
computational resources required to compress and uncompress the data.
SYNOPSIS OF THE PROJECT
1414
1. OBJECTIVE
The objective of this system is to compress and decompress files. This system will be used to
compress files , so that they may take less memory for storage and transmission from one
computer to another. This system will work in following ways:
To compress a text and image file using Huffman coding.
To decompress the compressed file to original format.
To show the compression ratio.
2. SCOPE
Our project will be able to Compress message in such a form that can be easily transmitted
over the network or from one system to another. At the receiver end after decompressing the
message receiver will get the original message. This is how effective transmission of data that
take place between sender and receiver.
Reusability:
1515
Reusability is possible as and when we require in this application. We can update it next
version. Reusable software reduces design, coding and testing cost by amortizing effort over
several designs. Reducing the amount of code also simplifies understanding, which increases
the likelihood that the code is correct. We follow up both types of reusability: Sharing of
newly written code within a project and reuse of previously written code on new projects.
Extensibility:
This software is extended in ways that its original developers may not expect. The following
principles enhance extensibility like Hide data structure, Avoid traversing multiple links or
methods, Avoid case statements on object type and distinguish public and private operations.
Robustness:
Its method is robust if it does not fail even if it receives improper parameters. There are some
facilities like Protect against errors, Optimize after the program runs, validating arguments
and Avoid predefined limits.
Understandability :
A method is understandable if someone other than the creator of the method can understand
the code (as well as the creator after a time lapse). We use the method, which small and
coherent helps to accomplish this.
Cost-effectiveness:
Its cost is under the budget and make within given time period. It is desirable to aim for a
system with a minimum cost subject to the condition that it must satisfy all the requirements.
1616
Scope of this document is to put down the requirements, clearly identifying the
information needed by the user, the source of the information and outputs expected
from the system
3. DESIGN PRINCIPLES & EXPLANATION
MODULE DESCRIPTION
There are following functions in project
Huffman Zip Encoder Decoder Table DLnode Priority Queue Huffman Node
Huffman zip is the main function which uses applet. It is used for user interface.
Encoder is the module for compressing the file. It implements Huffman algorithm for
compressing the text and image file. It first calculate the frequencies of all the occurring
symbols. Then on the basis of these frequencies it generates the priority queue. This priority
queue is used for finding the symbols with least frequencies. Now the two symbols with
lowest frequencies are deleted from the queue and a new symbol is added to the queue with
frequency equal to the sum of these two symbols. In the meanwhile we generate a tree with
leaf nodes are the two deleted node and the root node is the new node added to the queue. At
last we traverse the tree starting from the root node to the leaf node assigning 0 to the left
1717
child and 1 to the right node. In this way we assign code to every symbol in the file. These
are binary codes then we group these binary codes and calculate the equivalent integers and
store them in the output file, which is the compressed file.
Decoder works in the reverse order as the encoder. It reads the input from the compressed file
and convert it into equivalent binary code. It has one another input the binary tree generated
in the encoding process and on the basis of these data it generates the original file. This
project is based on lossless compression.
Table is used for storing the codes of each symbol. Priority queue takes input the symbols
and there related frequencies and on the basis of these frequencies it assign priorities to each
symbol. Huffman node is used for creating the binary tree it takes input two symbol from the
priority queue and create two nodes by comparing the frequencies of these two symbol. It
places the symbol with less frequency to the left and the symbol with high frequency to the
right, it then deletes these two symbol from the priority queue and places a new symbol with
frequency equal to the sum of frequencies of these two deleted symbol. It also generate a
parent node to the two node and assign frequency equal to the sum of frequencies of the two
leaf node.
5. HARDWARE & SOFTWARE REQUIREMENTS
1818
Existing hardware will be used:
Intel Pentium-IV
128 MB RAM
SVGA Color Monitor on PCI with 1MB RAM
101 Keys Keyboard
1 Microsoft Mouse with pad
Tools / Platform Language Used:
Language: Java
OS: Any OS such as Windows XP/98/NT,
Database: MS Access.
MAIN REPORT
1919
OBJECTIVE AND SCOPE
The objective of this system is to compress and decompress files. This system will be used to
compress files , so that they may take less memory for storage and transmission from one
computer to another. This system will work in following ways:
To compress a text and image file using Huffman coding.
To decompress the compressed file to original format.
To show the compression ratio.
SCOPE
Our project will be able to Compress message in such a form that can be easily transmitted
over the network or from one system to another. At the receiver end after decompressing the
message receiver will get the original message. This is how effective transmission of data that
take place between sender and receiver.
Reusability:
Reusability is possible as and when we require in this application. We can update it next
version. Reusable software reduces design, coding and testing cost by amortizing effort over
several designs. Reducing the amount of code also simplifies understanding, which increases
the likelihood that the code is correct. We follow up both types of reusability: Sharing of
newly written code within a project and reuse of previously written code on new projects.
Extensibility:
2020
This software is extended in ways that its original developers may not expect. The following
principles enhance extensibility like Hide data structure, Avoid traversing multiple links or
methods, Avoid case statements on object type and distinguish public and private operations.
Robustness:
Its method is robust if it does not fail even if it receives improper parameters. There are some
facilities like Protect against errors, Optimize after the program runs, validating arguments
and Avoid predefined limits.
Understandability :
A method is understandable if someone other than the creator of the method can understand
the code (as well as the creator after a time lapse). We use the method, which small and
coherent helps to accomplish this.
Cost-effectiveness:
Its cost is under the budget and make within given time period. It is desirable to aim for a
system with a minimum cost subject to the condition that it must satisfy all the requirements.
Scope of this document is to put down the requirements, clearly identifying the
information needed by the user, the source of the information and outputs expected from
the system
2121
THEORETICAL BACKGROUND
Introduction
A brief introduction to information theory is provided in this section. The definitions and
assumptions necessary to a comprehensive discussion and evaluation of data compression
methods are discussed. The following string of characters is used to illustrate the concepts
defined: EXAMPLE = aa bbb cccc ddddd eeeeee fffffff gggggggg.
2222
Theory:
The theoretical background of compression is provided by information theory (which is
closely related to algorithmic information theory) and by rate-distortion theory. These fields
of study were essentially created by Claude Shannon, who published fundamental papers on
the topic in the late 1940s and early 1950s. Doyle and Carlson (2000) wrote that data
compression "has one of the simplest and most elegant design theories in all of engineering".
Cryptography and coding theory are also closely related. The idea of data compression is
deeply connected with statistical inference.
Many lossless data compression systems can be viewed in terms of a four-stage model. Lossy
data compression systems typically include even more stages, including, for example,
prediction, frequency transformation, and quantization.
The Lempel-Ziv (LZ) compression methods are among the most popular algorithms for
lossless storage. DEFLATE is a variation on LZ which is optimized for decompression speed
and compression ratio, although compression can be slow. LZW (Lempel-Ziv-Welch) is used
in GIF images. LZ methods utilize a table based compression model where table entries are
substituted for repeated strings of data. For most LZ methods, this table is generated
dynamically from earlier data in the input. The table itself is often Huffman encoded (e.g.
SHRI, LZX).
The very best compressors use probabilistic models whose predictions are coupled to an
algorithm called arithmetic coding. Arithmetic coding, invented by Jorma Rissanen, and
turned into a practical method by Witten, Neal, and Cleary, achieves superior compression to
the better-known Huffman algorithm, and lends itself especially well to adaptive data
compression tasks where the predictions are strongly context-dependent.
Definition :
2323
In computer science and information theory, data compression or source coding is the process
of encoding information using fewer bits (or other information-bearing units) than an
unencoded representation would use through use of specific encoding schemes. For example,
this article could be encoded with fewer bits if one were to accept the convention that the
word "compression" be encoded as "comp". One popular instance of compression with which
many computer users are familiar is the ZIP file format, which, as well as providing
compression, acts as an archiver, storing many files in a single output file.
As is the case with any form of communication, compressed data communication only works
when both the sender and receiver of the information understand the encoding scheme. For
example, this text makes sense only if the receiver understands that it is intended to be
interpreted as characters representing the English language. Similarly, compressed data can
only be understood if the decoding method is known by the receiver.
Compression is useful because it helps reduce the consumption of expensive resources, such
as hard disk space or transmission bandwidth. On the downside, compressed data must be
decompressed to be viewed (or heard), and this extra processing may be detrimental to some
applications. For instance, a compression scheme for video may require expensive hardware
for the video to be decompressed fast enough to be viewed as it's being decompressed (you
always have the option of decompressing the video in full before you watch it, but this is
inconvenient and requires storage space to put the uncompressed video). The design of data
compression schemes therefore involve trade-offs between various factors, including the
degree of compression, the amount of distortion introduced (if using a lossy compression
scheme), and the computational resources required to compress and uncompress the data.
2424
A code is a mapping of source messages (words from the source alphabet alpha) into
codewords (words of the code alphabet beta). The source messages are the basic units into
which the string to be represented is partitioned. These basic units may be single symbols
from the source alphabet, or they may be strings of symbols. For string EXAMPLE, alpha = {
a, b, c, d, e, f, g, space}. For purposes of explanation, beta will be taken to be { 0, 1 }. Codes
can be categorized as block-block, block-variable, variable-block or variable-variable, where
block-block indicates that the source messages and codewords are of fixed length and
variable-variable codes map variable-length source messages into variable-length codewords.
A block-block code for EXAMPLE is shown in Figure 1.1 and a variable-variable code is
given in Figure 1.2. If the string EXAMPLE were coded using the Figure 1.1 code, the length
of the coded message would be 120; using Figure 1.2 the length would be 30.
source message codeword source message codeword
a 000 aa 0
b 001 bbb 1
c 010 cccc 10
d 011 ddddd 11
e 100 eeeeee 100
f 101 fffffff 101
g 110 gggggggg 110
space 111 space 111
2525
The oldest and most widely used codes, ASCII and EBCDIC, are examples of block-block
codes, mapping an alphabet of 64 (or 256) single characters onto 6-bit (or 8-bit) codewords.
These are not discussed, as they do not provide compression. The codes featured in this
survey are of the block-variable, variable-variable, and variable-block types.
When source messages of variable length are allowed, the question of how a message
ensemble (sequence of messages) is parsed into individual messages arises. Many of the
algorithms described here are defined-word schemes. That is, the set of source messages is
determined prior to the invocation of the coding scheme. For example, in text file processing
each character may constitute a message, or messages may be defined to consist of
alphanumeric and non-alphanumeric strings.
In Pascal source code, each token may represent a message. All codes involving fixed-length
source messages are, by default, defined-word codes. In free-parse methods, the coding
algorithm itself parses the ensemble into variable-length sequences of symbols. Most of the
known data compression methods are defined-word schemes; the free-parse model differs in
a fundamental way from the classical coding paradigm.
A code is distinct if each codeword is distinguishable from every other (i.e., the mapping
from source messages to codewords is one-to-one). A distinct code is uniquely decodable if
every codeword is identifiable when immersed in a sequence of codewords. Clearly, each of
these features is desirable. The codes of Figure 1.1 and Figure 1.2 are both distinct, but the
code of Figure 1.2 is not uniquely decodable. For example, the coded message 11 could be
decoded as either ddddd or bbbbbb. A uniquely decodable code is a prefix code (or prefix-
free code) if it has the prefix property, which requires that no codeword is a proper prefix of
any other codeword. All uniquely decodable block-block and variable-block codes are prefix
codes. The code with codewords { 1, 100000, 00 } is an example of a code which is uniquely
decodable but which does not have the prefix property. Prefix codes are instantaneously
2626
decodable; that is, they have the desirable property that the coded message can be parsed into
codewords without the need for lookahead. In order to decode a message encoded using the
codeword set { 1, 100000, 00 }, lookahead is required. For example, the first codeword of the
message 1000000001 is 1, but this cannot be determined until the last (tenth) symbol of the
message is read (if the string of zeros had been of odd length, then the first codeword would
have been 100000).
A minimal prefix code is a prefix code such that if x is a proper prefix of some codeword,
then x sigma is either a codeword or a proper prefix of a codeword, for each letter sigma in
beta. The set of codewords { 00, 01, 10 } is an example of a prefix code which is not
minimal. The fact that 1 is a proper prefix of the codeword 10 requires that 11 be either a
codeword or a proper prefix of a codeword, and it is neither. Intuitively, the minimality
constraint prevents the use of codewords which are longer than necessary. In the above
example the codeword 10 could be replaced by the codeword 1, yielding a minimal prefix
code with shorter codewords. The codes discussed in this paper are all minimal prefix codes.
In this section, a code has been defined to be a mapping from a source alphabet to a code
alphabet; we now define related terms. The process of transforming a source ensemble into a
coded message is coding or encoding. The encoded message may be referred to as an
encoding of the source ensemble. The algorithm which constructs the mapping and uses it to
transform the source ensemble is called the encoder. The decoder performs the inverse
operation, restoring the coded message to its original form.
Lossless vs. lossy compression:
2727
Lossless compression algorithms usually exploit statistical redundancy in such a way as to
represent the sender's data more concisely, but nevertheless perfectly. Lossless compression
is possible because most real-world data has statistical redundancy. For example, in English
text, the letter 'e' is much more common than the letter 'z', and the probability that the letter 'q'
will be followed by the letter 'z' is very small.
Another kind of compression, called lossy data compression, is possible if some loss of
fidelity is acceptable. For example, a person viewing a picture or television video scene might
not notice if some of its finest details are removed or not represented perfectly (i.e. may not
even notice compression artifacts). Similarly, two clips of audio may be perceived as the
same to a listener even though one is missing details found in the other. Lossy data
compression algorithms introduce relatively minor differences and represent the picture,
video, or audio using fewer bits.
Lossless compression schemes are reversible so that the original data can be reconstructed,
while lossy schemes accept some loss of data in order to achieve higher compression.
However, lossless data compression algorithms will always fail to compress some files;
indeed, any compression algorithm will necessarily fail to compress any data containing no
discernible patterns. Attempts to compress data that has been compressed already will
therefore usually result in an expansion, as will attempts to compress encrypted data.
In practice, lossy data compression will also come to a point where compressing again does
not work, although an extremely lossy algorithm, which for example always removes the last
byte of a file, will always compress a file up to the point where it is empty.
A good example of lossless vs. lossy compression is the following string -- 888883333333.
What you just saw was the string written in an uncompressed form. However, you could save
space by writing it 8[5]3[7]. By saying "5 eights, 7 threes", you still have the original string,
2828
just written in a smaller form. In a lossy system, using 83 instead, you cannot get the original
data back (at the benefit of a smaller filesize).
A small overview of different compression is presented below:
Image compression:
Image here refers to not only still images but also motion-pictures and compression is the
process used to reduce the physical size of a block of information.
Compression is simply representing information more efficiently; "squeezing the air" out of
the data, so to speak. It takes advantage of three common qualities of graphical data; they are
often redundant, predictable or unnecessary.
Today , compression has made a great impact on the storing of large volume of image data.
Even hardware and software for compression and decompression are increasingly being made
part of a computer platform. Compression does have its trade-offs. The more efficient the
compression technique, the more complicated the algorithm will be and thus, requires more
computational resources or more time to decompress. This tends to affect the speed. Speed is
not so much of an importance to still images but weighs a lot in motion-pictures. Surely you
do not want to see your favourite movies appearing frame by frame in front of you.
Most methods for irreversible, or ``lossy'' digital image compression, consist of three main
steps: Transform, quantizing and coding, as illustrated in figure
2929
The three steps of digital image compression.
Image compression is the application of Data compression on digital images. In effect, the
objective is to reduce redundancy of the image data in order to be able to store or transmit
data in an efficient form.
Image compression can be lossy or lossless. Lossless compression is sometimes preferred for
artificial images such as technical drawings, icons or comics. This is because lossy
compression methods, especially when used at low bit rates, introduce compression artifacts.
Lossless compression methods may also be preferred for high value content, such as medical
imagery or image scans made for archival purposes. Lossy methods are especially suitable for
natural images such as photos in applications where minor (sometimes imperceptible) loss of
fidelity is acceptable to achieve a substantial reduction in bit rate.
The best image quality at a given bit-rate (or compression rate) is the main goal of image
compression. However, there are other important properties of image compression schemes:
Scalability generally refers to a quality reduction achieved by manipulation of the bitstream
or file (without decompression and re-compression). Other names for scalability are
progressive coding or embedded bitstreams. Despite its contrary nature, scalability can also
be found in lossless codecs, usually in form of coarse-to-fine pixel scans. Scalability is
especially useful for previewing images while downloading them (e.g. in a web browser) or
for providing variable quality access to e.g. databases. There are several types of scalability:
3030
Region of interest coding Certain parts of the image are encoded with higher quality than
others. This can be combined with scalability (encode these parts first, others later).
Meta information Compressed data can contain information about the image which can be
used to categorize, search or browse images. Such information can include color and texture
statistics, small preview images and author/copyright information.
The quality of a compression method is often measured by the Peak signal-to-noise ratio. It
measures the amount of noise introduced through a lossy compression of the image.
However, the subjective judgement of the viewer is also regarded as an important, perhaps
the most important measure.
Video Compression:
A raw video stream tends to be quite demanding when it comes to storage requirements, and
demand for network capacity when being transferred between computers. Before being stored
or transferred, the raw stream is usually transformed to a representation using compression.
When compressing an image sequence, one may consider the sequence a series of
independent images, and compress each frame using single image compression methods, or
one may use specialized video sequence compression schemes, taking advantage of
similarities in nearby frames. The latter will generally compress better, but may complicate
handling of variations in network transfer speed.
Compression algorithms may be classified into two main groups, reversible and irreversible.
If the result of compression followed by decompression gives a bitwise exact copy of the
original for every compressed image, the method is reversible. This implies that no
quantizing is done, and that the transform is accurately invertible, i.e. it does not introduce
round-off errors.
3131
When compressing general data, like an executable program file or an accounting database, it
is extremely important that the data can be reconstructed exactly. For images and sound, it is
often convenient, or even necessary to allow a certain degradation, as long as it is not too
noticeable by an observer.
Text compression:
The following methods yield two basic data compression algorithms, which produce good
compression ratios and run in linear time.
The first strategy is a statistical encoding that takes into account the frequencies of symbols
to built a uniquely decipherable code optimal with respect to the compression criterion.
Huffman method (1951) provides such an optimal statistical coding. It admits a dynamic
version where symbol counting is done at coding time. The command "compact" of UNIX
implements this version.
Ziv and Lempel (1977) designed a compression method using encoding segments. These
segments are stored in a dictionary that is built during the compression process. When a
segment of the dictionary is encountered later while scanning the original text it is substituted
by its index in the dictionary. In the model where portions of the text are replaced by pointers
on previous occurrences, the Ziv and Lempel's compression scheme can be proved to be
asymptotically optimal (on large enough texts satisfying good conditions on the probability
distribution of symbols). The dictionary is the central point of the algorithm. Furthermore, a
hashing technique makes its implementation efficient. This technique improved by Welch
(1984) is implemented by the "compress" command of the UNIX operating system.
3232
The problems and algorithms discussed above give a sample of text processing methods.
Several other algorithms improve on their performance when the memory space or the
number of processors of a parallel machine are considered for example. Methods also extend
to other discrete objects such as trees and images.
3333
LZW ALGORITHM
Compressor algorithm:
w = NIL; while (read a char c) do if (wc exists in dictionary) then w = wc; else add wc to the dictionary; output the code for w; w = c; endif done output the code for w;
Decompressor algorithm:
read a char k; output k; w = k; while (read a char k) do if (index k exists in dictionary) then entry = dictionary entry for k; else if (index k does not exist in dictionary && k == currSizeDict) entry = w + w[0]; else signal invalid code; endif output entry; add w+entry[0] to the dictionary; w = entry; done
3434
DEFINITION OF THE PROBLEM
Problem Statement:
In today's world of computing, it is hardly possible to do without graphics, images and sound.
Just by looking at the applications around us, the Internet, development of Video CDs
(Compact Disks), Video Conferencing, and much more, all these applications use graphics
and sound intensively.
I guess many of us have surfed the Internet; have you ever become so frustrated in waiting
for a graphics intensive web page to be opened that you stopped the transfer … I bet you
have. Guess what will happened if those graphics are not compressed ?
Uncompressed graphics, audio and video data consumes very large amount of physical
storage which for the case of uncompressed video, even present CD technology is unable to
handle. Why is this so ?
CASE 1
Take for instance, if we want to display a TV-quality full motion Video, how much of
physical storage will be required ? Szuprowics states that "TV-quality video requires 720
kilobytes per frame (kbpf) displayed at 30 frames per second (fps) to obtain a full-motion
effect, which means that one second of digitised video consumes approximately 22 MB
(megabytes) of storage. A standard CD-ROM disk with 648 MB capacity and data transfer
rate of 150 KBps could only provide a total of 30 seconds of video and would take 5 seconds
3535
to display a single frame." Based on Szuprowics's statement we can see that this is clearly
unacceptable.
Transmission of uncompressed graphics, audio and video is a problem too. Expensive cables
with high bandwidth are required to achieve satisfactory result, which is not feasible for the
general market.
CASE 2
Take for example the transmission of uncompressed audio signal over the line for one second
:
Table is based on Steinmetz and Nahrstedt (1995)
From the table we can see that for better quality of sound transmitted over the channel, both
the bandwidth and storage requirement increases, and the size is not feasible at all.
Thus, to provide feasible and cost effective solutions, most multimedia systems use
compression techniques to handle graphics, audio and video data streams.
Therefore, in this paper I will address on one specific standard of compression, JPEG. And at
the same time, I will also be going through basic compression techniques that serve as the
building blocks for JPEG.
This paper focused on three forms of JPEG image compression : 1) Baseline Lossy JPEG ,2)
Progressive and 3) Motion JPEG. Each of their algorithm; characteristics and advantages will
be gone through.
3636
I hope that by the end of the paper, reader will gain more knowledge of JPEG, understand
how it works and not just know that it's another form of image compression standard.
.
SYSTEM ANALYSIS AND DESIGN
3737
Analysis and design refers to the process of examining a business situation with the intent of
improving it through better procedures and methods.
The two main steps of development are:
Analysis
Design
ANALYSIS:
System analysis is conducted with the following objectives in mind:
Identify the user’s need.
Evaluate the system concept for feasibility.
Perform economic and technical analysis.
Allocate functions to hardware, software, people, and other system elements.
Establish cost and schedule constraints.
Create a system definition that forms the foundation for all subsequent engineering work.
Both hardware and software expertise are required to successfully attain the objectives listed
above.
DESIGN
The most creative and challenging phase of the system life cycle is system design. The term
design describes a final system and the process by which it is developed. It refers to the
3838
technical specifications (analogous to the engineer’s blueprints) that will be applied in
implementing the candidate system. It also includes the construction of programs and
program testing. The key question here is: How should the problem be solved? The major
steps in designing are:
The first step is to determine how the output is to be produced and I what format. Samples of
the output (and input) are also presented. Second, input data and master files (data base) have
to be designed to meet the requirements of the proposed output. The operational (processing)
phases are handled through program construction and testing, including a list of the programs
needed to meet the system’s objectives and complete documentation. Finally, details related
to justification of the system and an estimate of the impact of the candidate system on the
user and the organization are documented and evaluated by management as a step towards
implementation.
The final report prior to the implementation phase includes procedural flowcharts, record
layouts, report layouts, and workable plans for implementing the candidate system.
Information on personnel, money, h/w, facilities and their estimated cost must also be
available. At this point, projected costs must be close to actual cost of implementation.
In some firms, separate groups of programmers do the programming where as other firm’s
employ analyst-programmers that do the analysis and design as well as code programs. For
this discussion, we assume that two separate persons carry out analysis and programming.
There are certain functions, though, that the analyst must perform while programs are being
written.
3939
SYSTEM DESIGN:
Software design sits at the technical kernel of software engineering and is applied regardless
of the software process model that is used. Beginning once software requirements have been
analyzed and specified, software design is the first of the three technical activities –Design,
Code generation and Test-that are required to build and verify the software. Each activity
transforms information in a manner that ultimately results in validated computer software.
The importance of software design can be stated with a
single word-quality. Design is the place where quality is fostered in software engineering.
Design provides us with representation of software that can be assessed for quality. Design is
the only way that we can accurately translate a customer’s requirements into a finished
software product or system. Software design serves as the foundation for all the software
engineering and software support steps that follow. Without design we risk building an
unstable system-one that will fall when small changes are made; one that may be difficult to
test; one whose quality cannot be assessed until late in the software process, when time is
short and many dollars have already been spent.
DESIGN OBJECTIVES:
Design phase of software development deals with transforming the customer requirements as
described in the SRS document into a form implement able using a programming language.
However, we can broadly classify various design activities into two important parts:
Preliminary (or high – level) design
Detailed design
4040
During high level design, different modules and the control relationships among them are
identified and interfaces among these modules are defined. The outcome of high level design
is called the “Program Structure” or “Software Architecture”. The structure chart is used to
represent the control hierarchy in a high level design.
During detailed design, the data structure and the algorithms used by different modules are
designed. The outcome of the detailed design is usually known as the “Module Specification”
document.
A good design should capture all the functionality of the system correctly. It should be easily
understandable, efficient and it should be easily amenable to change that is easily
maintainable. Understandability of a design is a major factor, which is used to evaluate the
goodness of a design, since a design that is easily understandable is also easy to maintain and
change.
In order to enhance the understandability of a design, it should have the following features:
Use of consistent and meaningful names for various design components.
Use of cleanly decomposed set of modules.
Neat arrangement of modules in a hierarchy that is tree-like diagram.
Modular design is one of the fundamental principles of a good design. Decomposition of a
problem into modules facilitates taking advantage of the divide and conquers principle if
different modules are almost independent of each other then each module can be understood
separately, eventually reducing the complexity greatly.
4141
Clean decomposition of a design problem into modules means that the modules in a software
design should display “High Cohesion and Low Coupling”.
The primary characteristics of clean decomposition are high cohesion and low coupling.
“Cohesion” is a measure of the functional strength of a module.
“Coupling” of a module with another module is a measure of the design of functional
independence or interaction between the two modules.
A module having high cohesion and low coupling is said to be “Functional Independent” of
other modules by the term functional independence we mean that a Cohesive module
performs a single task or function.
Functionally independent module has minimal interaction with other modules. Functional
independence is a key to good design primarily due to the following reasons:
Functional independence reduces error propagation. An error existing in one module does not
directly affect other modules and also any error existing in other modules does not directly
this module.
Reuse of a module is possible because each module performs some well-defined and precise
function and the interface of the module with other modules is simple and minimum
complexity of the design is reduced because different modules can be understood in isolation,
as modules are more or less independent of each other.
DESIGN PRINCIPLES:
Top-Down and Bottom-Up Strategies
4242
Modularity
Abstraction
Problem Partitioning and Hierarchy
TOP-DOWM AND BOTTOM-UP STRATEGIES:
A system consists of components, which have components of their own; indeed a system is a
hierarchy of components. The highest-level components correspond to the total system. To
design such hierarchies there are two possible approaches: top-down and bottom-up. The top-
down approach starts from the highest-level component of the hierarchy and proceeds
through to lower levels. By contrast, a bottom-up approach starts with the lowest-level
component of the hierarchy and proceeds through progressively higher levels to the top-level
component.
Top-down design methods often result in some form of “stepwise refinement.” Starting from
an abstract design, in each step the design is refined to more concrete to a more concrete
level, until we reach a level where no more refinement is needed and the design can be
implemented directly. Bottom-up methods work with “layers of abstraction” Starting from
the very bottom, operations that provide a layer of abstraction are implemented. The
operations of this layer are then used to implement more powerful operations and a still
higher layer of abstraction, until the stage is reached where the operations supported by the
layer are those desired by the system.
4343
MODULARITY :
The real power of partitioning comes if a system is partitioned into modules so that the
modules are solvable and modifiable separately. It will be even better if the modules are also
separately compliable. A system is considered modular if it consists of discrete components
so that each component can be implemented separately, and a change to one component has
minimal impact on other components.
Modularity is a clearly a desirable property in a system. Modularity helps in system
debugging-isolating the system problem to a component is easier if the system is modular-in
system repair-changing a part of the system is easy as it affects few other parts-and in system
building-a modular system can be easily built by “putting its modules together”.
ABSTRACTION :
Abstraction is a very powerful concept that is used in all-engineering disciplines. It is a tool
that permits a designer to consider a component at an abstract level without worrying about
the details of the implementation of the component. Any component or system provides some
services to its environment. An abstraction of a component describes the external behavior of
that component without bothering with the internal details that produce the behavior.
Presumably, the abstract definition of a component is much simpler than the component
itself.
4444
There are two common abstraction mechanisms for software systems: Functional abstraction
and Data abstraction.
In functional abstraction, a module is specified by the function it performs. For example, a
module to compute the log of a value can be abstractly represented by the function log.
Similarly, a module to sort an input array can be represented by the specification of sorting.
Functional abstraction is the basis of partitioning in function- oriented approaches. That is,
when the problem is being partitioned, the overall transformation function for the system is
partitioned into smaller functions that comprise the system function. The decomposition of
this is terms of functional modules.
The second unit for abstraction is data abstraction. Data abstraction forms the basis for
object-oriented design. In using this abstraction, a system is viewed as a set of objects
providing some services. Hence, the decomposition of the system is done with respect to the
objects the system contains.
Problem Partitioning and Hierarchy:
When solving a small problem, the entire problem can be tackled at once. For solving larger
problems, the basic principles are the time-tested principle of “divide and conquer”. Clearly,
dividing in such a manner that all the divisions have to be conquered together is not the intent
of this wisdom. This principle, if elaborated, would mean, “Divide into smaller pieces, so that
each piece can be conquered separately”.
4545
Problem partitioning, which is essential for solving a complex problem, leads to hierarchies
in the design. That is, the design produced by using problem partitioning can be represented
as a hierarchy of components. The relationship between the elements in this hierarchy can
vary depending on the method used. For example, the most common is the “whole-part of”
relationship. In this the system consists of some parts, each past consists of subparts, and so
on. This relationship can be naturally represented as a hierarchical structure between various
system parts. In general hierarchical structure makes it much easier to comprehend a complex
system. Due to this, all design methodologies aim to produce a design that has nice
hierarchical structures.
STAGES IN A SYSTEM’S LIFE CYCLE
Requirement Determination
A system is intended to meet the needs of an organization so as to save storage capacity. Thus
the first step in the design is to specify these needs or requirements. Determining the
requirements to be met by a system in an organization. Having done this, the next step is to
determine the requirements to be met by the system. Meetings of prospective user
departments are held and, through discussions, priorities among various applications are
determined, subject to the constraints of available computer memory, bandwidth, time taken
for transferring and budget.
Requirement Specification
4646
The top management of an organization first decides that a compression & decompression
system would be desirable to improve the operations of the organization. Once this basic
decision is taken, a system analyst is consulted. The first job of the system analyst is to
understand the existing system. During this stage he understands the various aspect of
algorithm, datastructures. Based on this he identifies what aspects of the operations of the
project need changes. The analyst discusses it and users his functions and determines the
areas where a changes can made it effective. The applications where a file transferring is
allowed is checked. It is not important to get the users involved from the initial stages of the
development of an application.
Feasibility Analysis
Having drawn up the rough specification, the next step is to check whether it is feasible to
implement the system. A feasibility study takes into account various constraints within which
the system should be implemented and operated. The resources needed for implementation
such as computing equipment, manpower and cost are estimated, based on the specifications
of user’s requirements. These estimates are compared with the available resources. A
comparison of the cost of the system and the benefits which will accrue is also made. This
document, known as the feasibility report, is given to the management of the organization.
Final Specifications
The developer of this s/w studies this feasibility report and suggests modifications in the
requirements, if any. Knowing the constraints on available resources, and the modified
4747
requirements specified by the organization, the final specifications of the system to be
developed are drawn up by the system analyst. These specifications should be in a form
which can be easily understood by the users. The specification state what the system would
achieve. It does not describe how the system would do it. These specifications are given back
to the users who study them, consult their colleagues and offer suggestions to the systems
analyst for appropriate changes. These changes are incorporated by the system analyst and a
new set of applications are given back to the users. After discussions between the system
analyst and the users the final specifications are drawn up which are approved for
implementation? Along with this, criteria for system approval are specified, which will
normally include a system test plan.
Hardware Study
Based on the finalized specifications it is necessary to determine the configuration of
hardware and support software essential to execute the specified application.
System Design
The next step is to develop the logical design of the system. The inputs to the system design
phase are functional specifications of the system and details about the computer
configuration. During this phase the logic of the programs is designed, and program test
plans and implementation plan are drawn up. The system design should begin from the
objectives of the system.
4848
System Implementation
The next phase is implementation of the system. In this phase all the programs are written,
user operational document is written, users are trained, and the system tested with operational
data.
System Evaluation
After the system has been in operation for a reasonable period, it is evaluated and a plan for
its improvement is drawn up .This is called system life cycle. The shortcomings of a system-
namely, what a user expected from the system and what he actually got-are realized only after
a system is used for a reasonable time. Similarly, the shortcomings in this system are realized
only after it is implemented and used for sometime.
System Modification
A computer-based system is a piece of software. It can be modified. Modifications will
definitely cost time and money. But users expect modifications to be made as the name
‘software’ itself implies it is soft and hence changeable.
Further, systems designed for use by clients cannot be static. These systems are intended for
real world problem. The environment in which a activity is conducted never remains static.
New changes occurred . New efficient algorithms occurred as research have been going on..
Thus a system which cannot be modified to fulfill the changing requirements of an
organization is bad. A system should be designed for change. The strength of a good
4949
computer-based system is that it is amenable to change. A good system designer is one who
can foresee what aspects of a system would change and would design the system in a flexible
way to easily accommodate changes.
SYSTEM PLANNINGSYSTEM PLANNING
To understand system development, we need to recognize that a candidate has
a planning, just like living system or a new product. System analysis and design are keyed to
the system planning. The analyst must progress from one stage to another methodically,
answering key questions and achieving results in each stage.
RECOGNITION OF NEED
One must know what the problem is before it can be solved. The basis for a
candidate system is recognition of a need for improving an information system or procedure.
The need leads to a preliminary survey or n initial investigation to determine whether an
alternative system can solve the problem. It entails looking into the duplication of effort,
bottlenecks, inefficient existing procedure, or whether parts of the existing system would be
candidates for computerization.
FEASIBILITY STUDY:
Many feasibility studies are disillusioning for both users and analysts. First, the study often
pre supposes that when the feasibility document is being prepared, the analyst is in a position
to evaluate solutions. Second, most studies tend to overlook the confusion inherent in the
5050
system develop the constraints and assumed attitudes .If the feasibility study is to serve as
decision document, it must answer three key questions:
Is there a new and a better way to do the job that it will benefit the user?
What are the costs and savings of the alternative(s)?
What is recommended?
The most successful system projects are not necessarily the biggest or Most visible in a
business but rather than truly meets user expectations. Most projects fail because of inflated
Expectations than for any reason.
Feasibility study is broadly divided into three parts:
Economic feasibility
Technical feasibility
Operational feasibility
1. ECONOMIC FEASIBILITY:
It is the most frequently used method for evaluating the effectiveness of a system that is
expected from the system and compares them with costs. If benefits outweigh costs then the
decision is made to design and implement the system. Otherwise, further justification or
alteration in the proposed system will have to be made if it is to have a change of being
approved. This is an ongoing effort that improves in accuracy at each phase of the system life
cycle.
So in our system we have considered these categories for the purpose of
cost/benefits analysis or economic feasibility.
5151
1. Hardware Cost:
It relates to the actual purchase or lease of computer and peripherals (for example, printer,
disk, drive, tape unit). Determining the actual cost of the hardware is generally more difficult
when various users than for a dedicated stand-alone system share the system. In some cases,
the best way to control for this cost is to treat it as an operating cost.
In this system we are taking it as operating cost so as to minimize the cost of the initial
installation of the computer hardware.
2. Personnel Cost:
It includes EDP staff salaries and benefits (health insurance, vacation time, sick pay, pay,
etc.) as well as pay for those involved in developing the system. Cost incurred during the
development of a system is Online costs and labeled development costs. Once the system is
installed, the costs of operating and maintaining the system become recurring cost.
Facility costs are expanses incurred in the preparation of the physical site where the
application or the computer will be in operation. This includes wiring, flooring, acoustics,
lighting and air conditioning. These costs are treated as one-time costs and are incorporated
in to the overall cost estimate of the candidate system.
As our proposed system it incurred only wiring cost now a days all the sites are well
maintained such as flooring and lighting. Thus it would not go to incur extra expanse.
5252
Operating cost includes all costs associated with the day-to-day operation of the system; the
amount depends on the number of shifts, the nature of applications, and the caliber of the
operating staff. There are various ways of covering the operating costs. One approach is to
treat the operating cost as the overhead. Another approach is to charge each authorized use
for the amount the processing they request from the system. The amount charged is based on
the computer time, staff time, and the volume of the output produced. In any case, some
accounting is necessary to determine how operating costs should be handled.
As our candidate system is not so big we require only one server and some few terminals for
data maintaining and processing of data. Their costs can be easily determined at the
installation time of the proposed system. As computer is also a machine so it also has
depreciation by using any of the depreciation methods we can determine its annual costs after
deducting the depreciation cost.
Supply cost is variable costs that increase use of paper, ribbons, disks, and the like. They
should be estimated and included in the overall cost of the system.
A system is also expected to provide benefits. The first task is to identify each benefit and
then assign a monetary value to it for cost/benefit analysis. Benefits may be tangible and
intangible, direct and indirect.
The two major benefits are improving performance and minimizing the cost of
processing. The performance category emphasizes improvements in the accuracy of or access
to information and easier access to the system by authorized users. Minimizing costs through
an efficient system error-control or reduction of staff-is a benefit that should be measured and
included on cost/benefit analysis.
5353
This cost in our proposed system is dependent on the number of customers so sometimes it is
more or sometimes it is less. It is not very easy to estimate this cost, what we can do is to
make a rough estimate of this cost and when this system is installed at a client side we can
compare this rough estimated cost with the actual expenses incurred due to this supply cost.
2. TECHNICAL FEASIBILITY:
Technical feasibility centers on the exciting computer system (hardware, software, etc.) and
to what extent it can support the proposed edition for example, if the current computer is
operating at 80 percent capacity-an arbitrary ceiling- then running another application could
overload the system or require additional hardware. This involves financial consideration to
accommodate technical enhancements. If the budget I serious constraint, then the project is
judged not feasible.
Presently at our client side all the work is done manually so question of overload the system
performance and required an additional hardware is not raised thus our candidate system is
technically feasible.
3. OPERATIONAL FEASIBILITY:
People are inherently resistant to change, and computer has been known to facilitate change.
An estimate should be made of how strong a reaction the user staff is likely to have towards
the development of a computerized system. It is common knowledge that computer
5454
installations have something to do with turnover, transfers, retraining, and changes in
employee hob status. Therefore, it is understandable that the introduction of a candidate
system requires special efforts to educate, sell, and train the staff on new ways of conducting
business.
There is no doubt that the people are inherently resistant to change, and computers
have been known to facilitate change. As in today's world all the work is computerized
because of computerization people only get benefits. As far as our system is concerned it is
only going to benefit the staff of the clinic in their daily routine work. There is no danger of
someone is loosing job or not get proper attention after the installation of our proposed
system. Thus our system is operationally feasible also.
REQUIREMENT ANALYSIS
Analysis is a detailed study of the various operations performed by a system and
their relationship within and outside the system. One aspect of analysis is defining the
boundaries of the system and determining whether or not a candidate system should consider
other related system. During analysis, data are collected on the available files, decision
points, and transactions handled by the present system.
Dataflow diagrams, interviews, on-site observations, and questionnaires are
examples. The interview is commonly used tool in analysis. It requires special skills and
sensitivity to the subjects being interviewed. Bias in data collection and interpretation can be
a problem, training, experience and commonsense are required for collection of the
information needed to do the analysis.
5555
Once analysis is completed, the next step is to decide how the problem might be
solved. Thus in, system design, we move from the logical to the physical aspects of the
System Planning.
5656
HARDWARE & SOFTWARE REQUIREMENTS
HARDWARE SPECIFICATIONS:
Processor Pentium- I\II\III\higher
Ram 128 MB RAM or higher
Monitor 15” Inch (Digital) with 800 X 600 support
Keyboard 101 Keys keyboard
Mouse 2 Button Serial/ PS-2
Tools / Platform Language Used:
Language: Java
OS: Any OS such as Windows XP/98/NT/Vista
5757
PROJECT DESCRIPTION
What is Huffman Algorithm:
Huffman is a coding algorithm presented by David Huffman in 1952. It's an algorithm which
works with integer length codes. In fact if we want an algorithm which does integer length
codes, huffman is the best option because it's optimal.
We use huffman for example, for compressing the bytes outputted by lzp. First we have to
know the probabilities of them, we use a qsm model for that matter. Based on the
probabilities it makes the codes which then can be outputted. Decoding is more or less the
reverse process, based on the probabilities and the coded data, it outputs the decoded byte.
To make the probabilities the algorithm uses a binary tree. It stores there the symbols and
their probabilities. The position of the symbol depends on its probability. Then it assigns a
code based on its position in the tree. The codes have the prefix property and are
instantaneously decodable thus they are well suited for compression and decompression.
The Huffman compression algorithm assumes data files consist of some byte values that
occur more frequently than other byte values in the same file. This is very true for text files
and most raw gif images, as well as EXE and COM file code segments.
By analyzing, the algorithm builds a "Frequency Table" for each byte value within a file.
With the frequency table the algorithm can then build the "Huffman Tree" from the frequency
table. The purpose of the tree is to associate each byte value with a bit string of variable
length. The more frequently used characters get shorter bit strings, while the less frequent
characters get longer bit strings. Thusly the data file may be compressed.
5858
To compress the file, the Huffman algorithm reads the file a second time, converting each
byte value into the bit string assigned to it by the Huffman Tree and then writing the bit string
to a new file. The decompression routine reverses the process by reading in the stored
frequency table (presumably stored in the compressed file as a header) that was used in
compressing the file. With the frequency table the decompressor can then re-build the
Huffman Tree, and from that, extrapolate all the bit strings stored in the compressed file to
their original byte value form.
Huffman Encoding :
Huffman encoding works by substituting more efficient codes for data and the codes are then
stored as a conversion table and passed to the decoder before the decoding process takes
place. This approach was first introduced by David Huffman in 1952 for text files and has
spawned many variations. Even CCITT (International Telegraph and Telephone Consultative
Committee) 1 dimensional encoding used for bilevel, black and white image data
telecommunications is based on Huffman encoding.
Algorithm :
Basically in Huffman Encoding each unique value is assigned a binary code, with codes
varying in length. Shorter codes are then used for more frequently used values. These codes
are then stored into a conversion table and passed to the decoder before any decoding is done.
So how does the decoder starts assigning codes to the values ?
Let's imagine that there is this data stream that is going to be encoded by Huffman Encoding :
5959
AAAABCDEEEFFGGGH
The frequency for each unique value that appears are as follows :
A : 4, B : 1, C : 1, D : 1, E : 3, F : 2, G : 3, H :1
Based on the frequency count the encoder can generate a statistical model reflecting the
probability that each value will appear in the data stream :
A : 0.25, B : 0.0625, C : 0.0625, D : 0.0625, E : 0.1875, F : 0.125, G : 0.1875, H : 0.0625
From the statistical model the encoder can build a minimum code for each and store it in the
conversion table. The algorithm pairs up 2 values with the least probability, in this case we
take B and C and combine their probability so as to be treated as one unique value. Along the
way each value B, C and even BC is being assigned a 0 or 1 on their branch. This means that
0 and 1 will be the least significant bits of the codes B and C respectively. From there the
algorithm compares the remaining values for another 2 values with the smallest probability
and repeat the whole process again until they extend up to form a structure of a up-side down
tree. The whole process is illustrated as on the next page.
6060
6161
6262
6363
6464
The binary code for each of the unique value can then be known following down from the top
of the up-side down tree (most significant bit) until we reached the unique value we want
(least significant bit). Let's take for example we want to find the code for B : Follow the path
shown by the blue arrow on the diagram above, and arrive on B. Notice that beside each of
the paths we take, there is a bit value, combining each of these values which we came across,
and we will get the code for B : 1000. The same approach is then used to find all of the
unique values, and their codes are then stored in the conversion table.
Code Construction :
To assign codes you need only a single pass over the symbols, but before doing that you need
to calculate where the codes for each codelength start. To do so consider the following: The
longest code is all zeros and each code differs from the previous by 1 (I store them such that
the last bit of the code is in the least significant bit of a byte/word).
In the example this means:
Codes with length 4 start at 0000
Codes with length three start at (0000+4*1)>>1 = 010. There are 4 codes with length
4 (that is where the 4 comes from), so the next length 4 code would start at 0100. But
since it shall be a length 3 code we remove the last 0 (if we ever remove a 1 there is a
bug in the codelengths).
Codes with length 2 start at (010+2*1)>>1 = 10.
Codes with length 1 start at (10+2*1)>>1 = 10.
6565
Codes with length 0 start at (10+0*1)>>1 = 1. If anything else than 1 is start for the
codelength 0 there is a bug in the codelengths!
Then visit each symbol in alphabetical sequence (to ensure the second condition) and assign
the startvalue for the codelength of that symbol as code to that symbol. After that increment
the startvalue for that codelength by 1.
Maximum Length of a Huffman Code :
Apart from the ceil(log2(alphabetsize)) boundary for the nonzero bits in this particular
canonical huffman code it is useful to know the maximum length a huffman code can reach.
In fact there are two limits which must both be fulfilled.
No huffman code can be longer than alphabetsize-1. Proof: it is impossible to construct a
binary tree with N nodes and more than N-1 levels.
The maximum length of the code also depends on the number of samples you use to derive
your statistics from; the sequence is as follows (the samples include the fake samples to give
each symbol a nonzero probability!):
The Compression or Huffing Program:
To compress a file (sequence of characters) you need a table of bit encodings, e.g., an ASCII
table, or a table giving a sequence of bits that's used to encode each character. This table is
constructed from a coding tree using root-to-leaf paths to generate the bit sequence that
encodes each character.
Assuming you can write a specific number of bits at a time to a file, a compressed file is
made using the following top-level steps. These steps will be developed further into sub-
steps, and you'll eventually implement a program based on these ideas and sub-steps.
6666
Build a table of per-character encodings. The table may be given to you, e.g., an ASCII table,
or you may build the table from a Huffman coding tree.
Read the file to be compressed (the plain file) and process one character at a time. To process
each character find the bit sequence that encodes the character using the table built in the
previous step and write this bit sequence to the compressed file.
Building the Table for Compression:
To build a table of optimal per-character bit sequences you'll need to build a Huffman coding
tree using the greedy Huffman algorithm. The table is generated by following every root-to-
leaf path and recording the left/right 0/1 edges followed. These paths make the optimal
encoding bit sequences for each character.
There are three steps in creating the table:
1 Count the number of times every character occurs. Use these counts to create an initial
forest of one-node trees. Each node has a character and a weight equal to the number of times
the character occurs.
2 Use the greedy Huffman algorithm to build a single tree. The final tree will be used in the
next step.
3 Follow every root-to-leaf path creating a table of bit sequence encodings for every
character/leaf.
6767
Header Information:
You must store some initial information in the compressed file that will be used by the
uncompression/unhuffing program. Basically you must store the tree used to compress the
original file. This tree is used by the uncompression program.
There are several alternatives for storing the tree. Some are outlined here, you may explore
others as part of the specifications of your assignment.
Store the character counts at the beginning of the file. You can store counts for every
character, or counts for the non-zero characters. If you do the latter, you must include
some method for indicating the character, e.g., store character/count pairs.
You could use a "standard" character frequency, e.g., for any English language text
you could assume weights/frequencies for every character and use these in
constructing the tree for both compression and uncompression.
You can store the tree at the beginning of the file. One method for doing this is to do a
pre-order traversal, writing each node visited. You must differentiate leaf nodes from
internal/non-leaf nodes. One way to do this is write a single bit for each node, say 1
for leaf and 0 for non-leaf. For leaf nodes, you will also need to write the character
stored. For non-leaf nodes there's no information that needs to be written, just the bit
that indicates there's an internal node.
Decompressing:
6868
Decompression involves re-building the Huffman tree from a stored frequency table (again,
presumable in the header of the compressed file), and converting its bit streams into
characters. You read the file a bit at a time. Beginning at the root node in the Huffman Tree
and depending on the value of the bit, you take the right or left branch of the tree and then
return to read another bit. When the node you select is a leaf (it has no right and left child
nodes) you write its character value to the decompressed file and go back to the root node for
the next bit.
Transmission and storage of Huffman-encoded Data:
If your system is continually dealing with data in which the symbols have similar frequencies
of occurence, then both encoders and decoders can use a standard encoding table/decoding
tree. However, even text data from various sources will have quite different characteristics.
For example, ordinary English text will have generally have 'e' at the root of the tree, with
short encodings for 'a' and 't', whereas C programs would generally have ';' at the root, with
short encodings for other punctuation marks such as '(' and ')' (depending on the number and
length of comments!). If the data has variable frequencies, then, for optimal encoding, we
have to generate an encoding tree for each data set and store or transmit the encoding with the
data. The extra cost of transmitting the encoding tree means that we will not gain an overall
benefit unless the data stream to be encoded is quite long - so that the savings through
compression more than compensate for the cost of the transmitting the encoding tree also.
WORKING OF PROJECT:
MODULE & THEIR DESCRIPTION :-
6969
There are following functions in project
Huffman Zip
Encoder
Decoder
Table
DLNode
Priority Queue
Huffman Node
Huffman zip is the main function which uses applet. It is used for user interface. Encoder is
the module for compressing the file. It implements Huffman algorithm for compressing the
text and image file. It first calculate the frequencies of all the occurring symbols. Then on the
basis of these frequencies it generates the priority queue. This priority queue is used for
finding the symbols with least frequencies. Now the two symbols with lowest frequencies are
deleted from the queue and a new symbol is added to the queue with frequency equal to the
sum of these two symbols. In the meanwhile we generate a tree with leaf nodes are the two
deleted node and the root node is the new node added to the queue. At last we traverse the
tree starting from the root node to the leaf node assigning 0 to the left child and 1 to the right
node. In this way we assign code to every symbol in the file. These are binary codes then we
group these binary codes and calculate the equivalent integers and store them in the output
file, which is the compressed file.
7070
Decoder works in the reverse order as the encoder. It reads the input from the compressed file
and convert it into equivalent binary code. It has one another input the binary tree generated
in the encoding process and on the basis of these data it generates the original file. This
project is based on lossless compression.
Table is used for storing the codes of each symbol. Priority queue takes input the symbols
and there related frequencies and on the basis of these frequencies it assign priorities to each
symbol. Huffman node is used for creating the binary tree it takes input two symbol from the
priority queue and create two nodes by comparing the frequencies of these two symbol. It
places the symbol with less frequency to the left and the symbol with high frequency to the
right, it then deletes these two symbol from the priority queue and places a new symbol with
frequency equal to the sum of frequencies of these two deleted symbol. It also generate a
parent node to the two node and assign frequency equal to the sum of frequencies of the two
leaf node.
DATA FLOW DIAGRAM
When solving a small problem, the entire problem can be tackled at once. For solving larger
problems, the basic principles the time-tested principle of “divide and conquer”. Clearly,
7171
dividing in such a manner that all the divisions have to be conquered together is not the intent
of this wisdom. This principle, if elaborated, would mean “divide into smaller pieces, so that
each piece can be conquered separately”.
Problem partitioning, which is essential for solving a complex problem, leads to hierarchies
in the design. That is, the design produced by using problem partitioning can be represented
as a hierarchy of components. The relationship between the elements in this hierarchy can
vary depending on the method used. For example, the most common is the “whole-part of”
relationship. In this the system consists of some parts, each past consists of subparts, and so
on. This relationship can be naturally represented as a hierarchical structure between various
system parts. In general hierarchical structure makes it much easier to comprehend a complex
system. Due to this, all design methodologies aim to produce a design that has nice
hierarchical structures.
The DFD was first designed by Larry Constantine as a way of expressing system
requirements in a graphical form; this led to a modular design.
A DFD, also known as “bubble chart,” has the purpose of clarifying system requirements and
identifying major transformations that will become programs in system design. So it is the
starting point of the design phase that functionally decomposes the requirement specifications
down to the lowest level of detail. A DFD consists of series of bubbles joined by lines
represent data flows in the system.
DFD SYMBOLS
7272
In the DFD, there are four symbols.
1 A square defines a source (originator) or destination of system data.
2 An arrow identifies data flow- data in motion. It is a pipeline through which information
flows.
3 A circle or a “bubble” (some people use an oval bubble) represents a process that
transforms incoming data flows(s) into outgoing data flow(s).
4 An open rectangle is a data store-data at rest , or a temporary repository of data .
SYMBOLS MEANING
Source or destination of data
Data flow
Process that transform data flow
7373
Data Store
CONSTRUCTING DFD
Several rule of thumb are used in drawing D F D’s:
1 Processes should be named and numbered for easy reference. Each name should be
representative of the process.
2 The direction of flow is from top to bottom and from left to right. Data traditionally flow
from the source (upper left corner) to the destination (lower right corner), although they may
flow back to a source. One way to indicate this is to draw a long flow line back to the source.
An alternative way is to repeat the source symbol as a destination. Since it is used more than
once in the DFD, it is marked with a short diagonal in the lower right corner.
3 When a process is exploded into lower-level details, they are numbered.
4 The names of data sources and destinations are written in capital letters. Process and data
flows names have the first letter of each word capitalized.
HOW DETAILED SHOULD A DFD BE?
The DFD is designed to aid communication. If it contains dozens of processes and data stores
it gets too unwieldy. The rule thumb is to explode the DFD to a functional level, so that the
next sublevel does not exceed 10 processes. Beyond that, it is best to take each function
separately and expand it show the explosion of the single process. If a user wants to know
7474
what happens within a given process, then the detailed explosion of that process may be
shown.
A DFD typically shows the minimum contents of data elements that flow in and out.
A leveled set has a starting DFD, which is a very abstract representation of the system,
identifying the major inputs and outputs and the major processes in the system. Then each
process is refined and a DFD is drawn for the process. In other words, a bubble DFD is
expanded into a DFD during refinement. For the hierarchy to be consistent, it is important
that the net inputs and outputs of the DFD for a process are the same as the inputs and outputs
of the process are the same as the inputs and the outputs of the process in the higher level
DFD. This refinement stops if each bubble can be easily identified or understood. It should be
pointed out that during refinement, though the net input and output are preserved, a
refinement of the data might also occur. That is , a unit of data may be broken into its
components for processing when the detailed DFD for a process is being drawn .So , as the
process are decomposed, data decomposition also occurs.
The DFD methodology is quite effective, especially when the required design is unclear the
analyst need a notational language for communication. The DFD is easy to understand for
communication. The DFD is easy to understand after a brief orientation.
The main problem however is the large number of iterations that often are required to arrives
at the most accurate and complete solution.
DATA FLOW DIAGRAM
7575
The DFD helps to understand the functioning & module used in the coding . It describe easily
flow and store of the data.What variable are given in input & flow of data in the program &
the final output. Here we are referencing some DFD’s which helps in understanding the
program
7676
77
Code generatorUpdation of
priority queue
77
Traverse Code store
Print Layouts
7878
7979
8080
8181
IMPLEMENTATION:
The implementation phase is less creative than system design. It is primarily concerned with
user training, site preparation, and file conversion. When the candidate system is linked to
terminals to remote sites, the telecommunication network and test of the network along with
the system are also included under implementation.
During the implementation phase, the system actually takes physical shape
As in the other two stages, the analyst, his or her associates and the user performs many tasks
including: -
Writing, testing, debugging and documenting systems.
Converting data from the old to the new system.
Training the system’s users.
Completing system documentation.
Evaluating the final system to make sure that it is fulfilling original need and that it
began operation on time and within budget.
The analyst involvement in each of these activities varies from organization to organization .
For a small organizations, specialists may work on different phases and tasks, such as
training, ordering equipment, converting data from old methods to the new or certifying the
correctness of the system.
The implementation phase with an evaluation of the system after placing it into operation
for a period of time .by then, most program errors will have shown up and most costs will
have become clear .To make sure that the system audit is a last check or review of a system
8282
to ensure that it meets design criteria. Evaluation forms the feedback part of the cycle that
keeps implementation going as long as the system continues operation.
Ordering and installing any new hardware required by the system.
Developing operating procedures for the computer center staff.
Establishing a maintenance procedure to repair and enhance the system.
During the final testing user acceptance is tested followed by user training. Depending on the
nature of the system, extensive user training may be required. Conversion usually takes place
at about the same time the user is being trained or later
In the extreme, the programmer is falsely viewed as some who ought to be isolated from
other aspects of system development. Programming is itself design work, however. The
initial parameters of the candidates system should be modified as a result of programming
efforts. Programming provides a “reality test” for the assumptions maid by the analyst it is
therefore a mistake to exclude programmers from the initial system design.
System testing checks the readiness and accuracy of the system to access update and retrieve
data from new files. Once the program becomes available test data are read into the computer
and processed against the file provide for testing in most conversions a parallel run is
conducted where the new system runs simultaneously with the old system this method though
costly provides added assurance against errors in the candidate system.
TEST PLAN
8383
A test plan is a service delivery agreement. It is a quality assurance’s way of communicating
to developer, the client, and the rest of the team, this is what can be expected.
The key point of test plan is:
Introduction: Summarizes key features and expectations of software along with testing approach.
Scope: It includes a description of text types.
Risks and assumptions : This part should define a risk to the testing phase, such as criteria that could suspend
testing.
Testing schedules and cycles: States when testing will be completed and the number of expected cycles.
Test resources: Specifies testers and bug fixers.
Some special terms in TestingFundamental
Error:
The term Error is used in two different ways. It refers to difference between the
actual output of the software and the correct output. In this interpretation, error is an essential
8484
measure of the difference actual and ideal output. Error is also used to refer to human action
that results in software containing a defect or fault.
Fault:
Fault is a condition that causes a system to fail in performing its required function. A
fault is a basic reason for software malfunction and is synonymous with the commonly used
term 'Bug'.
Failure : Failure is the inability of a system or component to perform a required function
according to its specifications. A software failure occurs if the behavior if the software is
different from the specified behavior. Failure may be caused due to functional or
performance reasons.
Some of the commonly used Strategies for Testing are as follows:-
8585
Unit Testing
Module testing
Integration testing
System testing
Acceptance testing
Unit Testing :
The term 'Unit Testing' comprises the set of tests performed by an
individual programmer prior to the integration of the unit into a larger system. The situation
is illustrated as follows:
A program unit is usually small enough, so the programmer who developed it can
test it in great detail, and certainly in greater detail than will be possible when the unit
is integrated into an evolving software product. In unit testing, the programs are tested
separately, independent of each other. Since the check is done at the program level, it is
also called Program Testing.
Module Testing :
A module encapsulates related component. So can be tested without other system modules.
Subsystem testing :
Subsystem testing may be independently designed and implemented. Common
problems such as sub-system interface mistakes can be checked and can concentrate on it
in this phase.
8686
Coding & Debugging
Unit Testing
Integration Testing
There are four categories of tests that a programmer will typically perform on a program
unit:
Functional Tests
Performance Test
Stress Test
Structure Test
Functional Test :
Functional test cases involves exercising the code with nominal input values for which
expected results are known, as well as boundary values (minimum values, maximum
values, and values on and just outside the functional boundaries) and special values.
Performance Test :
Performance testing determines the amount of execution time spent in various
parts of the unit, program throughput, response time, and device utilization by the
program unit. A certain amount of performance tuning may be done during testing,
however, caution must be exercised to avoid expending too much effort on fine tuning
of a program unit that contributes little to the overall performance of the entire system.
Performance testing is most productive at the subsystem and system levels.
8787
Stress Test :
Stress tests are those tests designed to intentionally break the unit. A great deal can be
learned about the strengths and limitations of a program by examining the manner in which
a program unit breaks.
Structure Test :
Structure tests are concerned with exercising the internal logic of a program and traversing
particular execution paths. Some authors refer collectively to functional performance and
stress testing as “black box” testing, while structure testing is referred to as “white box” or
“glass box” testing. The major activities in structural testing are deciding which path to
exercise, deriving test data to exercise those paths, determining the test coverage criterion to
be used and executing the test cases on some modules and subsystems. This mix
alleviates many of the problems encountered in pure top-down testing and retains the
advantages of top-down integration at the subsystem and system level.
Automated tools used in integration testing include module drivers, test data generators,
environment simulators, and a management facility to allow easy configuration and
reconfiguration of system elements. Automated modules drivers perm it specification of
test cases (both input and expected results) in a descriptive language. The driver tool
then calls the routine using specified test cases, compares actual with the expected results,
and reports discrepancies.
Some module drivers also provide program stubs for top-down testing. Test cases
are written for the stub, and when the stub is invoked by the routine being tested, the
8888
drivers examine the input parameters to the stub and return the corresponding outputs to
the routine. Automated test drivers include AUT, MTS, TEST MASTER and TPL.
Test data generators are of two varieties; those that generate files of random
data values according to some predefined format, and those that generate test data for
particular execution paths. In the latter category, symbolic executors such as ATTEST can
sometimes be used to driver a set of test data that will force program execution to follow a
particular control path.
Environment simulators are sometimes used during integration and
acceptance testing to simulate the operating environment in which the software will
function. Simulators are used in situation in which operation of the actual environment
is impractical. Examples of simulators are PRIM (GAL75) for emulating, machines
that do not exist, and the Saturn Flight Program Simulators for simulating live flight tests
cases, and measuring the coverage achieved when the test cases are exercised.
System Testing
System testing involves two kinds of activities:
Integration testing
Acceptance testing
Strategies for integrating software components into a functioning product include the
bottom-up strategy, the top-down strategy, and the sandwich strategy. Careful planning and
scheduling are required to ensure that modules will be available for integration into
the evolving software product when needed. The integration strategy dictates the order in
8989
which modules must be available, and thus exerts a strong influence on the order in
which modules are written, debugged, and unit tested.
Acceptance testing involves planning & execution of functional tests, performance
tests, and stress tests to verify that the implemented system satisfies its requirements.
Acceptance tests are typically performed by quality assurance and/or customer
organizations.
9090
CONCLUSIONS
Data compression is a topic of much importance and many applications. Methods of data
compression have been studied for almost four decades. This paper has provided an overview
of data compression methods of general utility. The algorithms have been evaluated in terms
of the amount of compression they provide, algorithm efficiency, and susceptibility to error.
While algorithm efficiency and susceptibility to error are relatively independent of the
characteristics of the source ensemble, the amount of compression achieved depends upon the
characteristics of the source to a great extent.
Semantic dependent data compression techniques are special-purpose methods designed to
exploit local redundancy or context information. A semantic dependent scheme can usually
be viewed as a special case of one or more general-purpose algorithms. It should also be
noted that algorithm HUFFMAN CODING & DECODING is a general-purpose technique
which exploits locality of reference, a type of local redundancy.
Susceptibility to error is the main drawback of each of the algorithms presented here.
Although channel errors are more devastating to adaptive algorithms than to static ones, it is
possible for an error to propagate without limit even in the static case. Methods of limiting
the effect of an error on the effectiveness of a data compression algorithm should be
investigated.
FUTURE ENHANCEMENT & NEW DIRECTIONS
9191
NEW DIRECTIONS :
Data compression is still very much an active research area. This section suggests
possibilities for further study.
The discussion of illustrates the susceptibility to error of the codes presented in this survey.
Strategies for increasing the reliability of these codes while incurring only a moderate loss of
efficiency would be of great value. This area appears to be largely unexplored. Possible
approaches include embedding the entire ensemble in an error-correcting code or reserving
one or more codewords to act as error flags. For Huffman encoding & decoding it may be
necessary for receiver and sender to verify the current code mapping.
Another important research topic is the development of theoretical models for data
compression which address the problem of local redundancy. Models based on Huffman
coding may be exploited to take advantage of interaction between groups of symbols.
Entropy tends to be overestimated when symbol interaction is not considered. Models which
exploit relationships between source messages may achieve better compression than
predicted by an entropy calculation based only upon symbol probabilities.
SCOPE FOR FUTURE WORK:
9292
Since this system has been generated by using Object Oriented programming, there are every
chances of reusability of the codes in other environment even in different platforms. Also its
present features can be enhanced by some simple modification in the codes so as to reuse it in
the changing scenario.
SCOPE OF FUTHER APPLICATION:
We can implement easily this application. Reusability is possible as and when we require in
this application. We can update it next version. We can add new features as and when we
require. There is flexibility in all the modules.
SOURCE CODE
HuffmanZip.java
9393
import javax.swing.*;import java.io.*;import java.awt.*;import java.awt.event.*;
public class HuffmanZip extends JFrame{
private JProgressBar bar;private JButton enc,dec,center;private JLabel title;private JFileChooser choose;private File input1,input2;private Encoder encoder;private Decoder decoder;private ImageIcon icon;
public HuffmanZip(){
super("Zip utility V1.1");
// Container con=getContentPane();Container c=getContentPane();enc=new JButton("Encode");dec=new JButton("Decode");center=new JButton();title=new JLabel(" Zip Utility V1.1 ");choose=new JFileChooser();
icon=new ImageIcon("huff.jpg");center.setIcon(icon);
enc.addActionListener(
new ActionListener(){
public void actionPerformed(ActionEvent e){
int f=choose.showOpenDialog(HuffmanZip.this);if (f==JFileChooser.APPROVE_OPTION)
9494
{
input1=choose.getSelectedFile();encoder=new Encoder(input1);
HuffmanZip.this.setTitle("Compressing.....");encoder.encode();
JOptionPane.showMessageDialog(null,encoder.getSummary(),"Summary",JOptionPane.INFORMATION_MESSAGE);
HuffmanZip.this.setTitle("Zip utility v1.1");
}}
}
);
dec.addActionListener(
new ActionListener(){
public void actionPerformed(ActionEvent e){
int f=choose.showOpenDialog(HuffmanZip.this);if (f==JFileChooser.APPROVE_OPTION){
input2=choose.getSelectedFile();decoder=new Decoder(input2);decoder.decode();
HuffmanZip.this.setTitle("Decompressing.....");
JOptionPane.showMessageDialog(null,decoder.getSummary(),"Summary",JOptionPane.INFORMATION_MESSAGE);
HuffmanZip.this.setTitle("Zip utility v1.1");
}}
}
);
9595
//c.add(bar,BorderLayout.SOUTH);c.add(dec,BorderLayout.EAST);c.add(enc,BorderLayout.WEST);c.add(center,BorderLayout.CENTER);c.add(title,BorderLayout.NORTH);
setSize(250,80);setVisible(true);
}
public static void main(String args[]){
HuffmanZip g=new HuffmanZip();g.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
}
Encoder.java
import java.io.*;
9696
import javax.swing.*;
public class Encoder{
private static String code[],summary="";private int totalBytes=0;private int count=0;private File inputFile;private File outputFile ;private FileOutputStream C;private ObjectOutputStream outF;private BufferedOutputStream outf;private FileInputStream in1;private BufferedInputStream in;private boolean done=false;
public Encoder(File inputFile){
this.inputFile=inputFile;}
public void encode() {
int freq[]=new int[256];
for(int i=0;i<256;i++){
freq[i]=0;}
// File inputFile = new File(JOptionPane.showInputDialog("Enter the input file name"));
try{
in1 = new FileInputStream(inputFile);in=new BufferedInputStream(in1);
}catch(Exception eee){
9797
}
try{
System.out.println(" "+in.available());totalBytes=in.available();int mycount=0;
in.mark(totalBytes);
while (mycount<totalBytes){
int a=in.read();mycount++;freq[a]++;
}in.reset();
}catch(IOException eofexc){
System.out.println("error");
}
HuffmanNode tree=new HuffmanNode(),one,two;PriorityQueue q=new PriorityQueue();
try{
for(int j=0;j<256;j++){
// System.out.println("\n"+byteval[j]+" "+freq[j]+" prob "+probablity[j]+"int value"+toInt(byteval[j]));
if (freq[j]>0){
HuffmanNode t=new HuffmanNode("dipu",freq[j],j,null,null,null);
q.insertM(t);}
9898
}
//create tree....................................
while (q.sizeQ()>1){
one=q.removeFirst();two=q.removeFirst();int f1=one.getFreq();int f2=two.getFreq();if (f1>f2){
HuffmanNode t=new HuffmanNode(null,(f1+f2),0,two,one,null);
one.up=t;two.up=t;q.insertM(t);
}else{
HuffmanNode t=new HuffmanNode(null,(f1+f2),0,one,two,null);
one.up=t;two.up=t;q.insertM(t);
}
}
tree =q.removeFirst();
}catch(Exception e){
System.out.println("Priority Queue error");}code=new String[256];for(int i=0;i<256;i++)
code[i]="";
traverse(tree);
Table rec=new Table(totalBytes,inputFile.getName());for(int i=0;i<256;i++){
rec.push(freq[i]);
9999
if(freq[i]==0)continue;
// System.out.println(""+i+" "+code[i]+" ");}
// System.out.println("size of table"+rec.recSize());
//create tree ends...........................
// System.out.println("\n total= "+totalBytes+"\n probablity="+d);int wrote=0,csize=0;int recordLast=0;
try{
outputFile = new File(inputFile.getName()+".hff");C=new FileOutputStream(outputFile);outF=new ObjectOutputStream(C);outf=new BufferedOutputStream(C);outF.writeObject(rec);String outbyte="";
while (count<totalBytes){
outbyte+=code[in.read()];count++;if (outbyte.length()>=8){
int k=toInt(outbyte.substring(0,8));csize++;outf.write(k);outbyte=outbyte.substring(8);
}}
while(outbyte.length()>8){
csize++;int k=toInt(outbyte.substring(0,8));
100100
outf.write(k);outbyte=outbyte.substring(8);
}if((recordLast=outbyte.length())>0){
while(outbyte.length()<8)outbyte+=0;
outf.write(toInt(outbyte));csize++;
}outf.write(recordLast);
outf.close();}catch(Exception re){
System.out.println("Error in writng....");}
float ff=(float)csize/((float)totalBytes);System.out.println("Compression "+recordLast+" ratio"+csize+"
"+(ff*100)+" %");
summary+="File name : "+ inputFile.getName();summary+="\n";
summary+="File size : "+totalBytes+" bytes.";summary+="\n";
summary+="Compressed size : "+ csize+" bytes.";summary+="\n";
summary+="Compression ratio: "+(ff*100)+" %";summary+="\n";
done=true;
}
private void traverse(HuffmanNode n){
if (n.lchild==null&&n.rchild==null){
101101
HuffmanNode m=n;int arr[]=new int[20],p=0;while (true){
if (m.up.lchild==m){
arr[p]=0;}else{
arr[p]=1;}p++;m=m.up;if(m.up==null)
break;}for(int j=p-1;j>=0;j--)
code[n.getValue()]+=arr[j];}
// System.out.println("Debug3");if(n.lchild!=null)
traverse(n.lchild);if(n.rchild!=null)
traverse(n.rchild);}
private String toBinary(int b) { int arr[]=new int[8]; String s=""; for(int i=0;i<8;i++) { arr[i]=b%2; b=b/2; } for(int i=7;i>=0;i--) { s+=arr[i]; } return s; }
private int toInt(String b) {
102102
int output=0,wg=128; for(int i=0;i<8;i++) {
output+=wg*Integer.parseInt(""+b.charAt(i));wg/=2;
} return output; }
public int lengthOftask(){
return totalBytes;}public int getCurrent(){
return count;}public String getSummary(){
String temp=summary;summary="";return temp;
}public boolean isDone(){
return done;}
}
Decoder.java
103103
import java.io.*;import javax.swing.*;
public class Decoder {
private int totalBytes=0,mycount=0;private int freq[],arr=0;private String summary="";private File inputFile;private Table table;
private FileInputStream in1;private ObjectInputStream inF;private BufferedInputStream in;
private File outputFile ;private FileOutputStream outf;
public Decoder(File file){
inputFile=file;}
public void decode()//throws Exception{
freq=new int[256];for(int i=0;i<256;i++){
freq[i]=0;}
// File inputFile = new File(JOptionPane.showInputDialog("Enter the input File name"));
try{
in1 = new FileInputStream(inputFile);inF=new ObjectInputStream(in1);in=new BufferedInputStream(in1);
// int arr=0;table=(Table)(inF.readObject());
104104
outputFile = new File(table.fileName());outf=new FileOutputStream(outputFile);
summary+="File name : "+ table.fileName();summary+="\n";
}catch(Exception exc){
System.out.println("Error creating file");JOptionPane.showMessageDialog(null,"Error"+"\nNot a
valid < hff > format file.","Summary",JOptionPane.INFORMATION_MESSAGE);
System.exit(0);}
HuffmanNode tree=new HuffmanNode(),one,two;PriorityQueue q=new PriorityQueue();
try{
//creating priority queue.................
for(int j=0;j<256;j++){
int r =table.pop();// System.out.println("Size of table "+r+" "+j);
if (r>0){
HuffmanNode t=new HuffmanNode("dipu",r,j,null,null,null);
q.insertM(t);}
}
//create tree....................................
while (q.sizeQ()>1){
one=q.removeFirst();two=q.removeFirst();int f1=one.getFreq();
105105
int f2=two.getFreq();if (f1>f2){
HuffmanNode t=new HuffmanNode(null,(f1+f2),0,two,one,null);
one.up=t;two.up=t;q.insertM(t);
}else{
HuffmanNode t=new HuffmanNode(null,(f1+f2),0,one,two,null);
one.up=t;two.up=t;q.insertM(t);
}
}
tree =q.removeFirst();
}catch(Exception exc){
System.out.println("Priority queue exception");}
String s="";
try{
mycount=in.available();while (totalBytes<mycount){
arr=in.read();s+=toBinary(arr);while (s.length()>32){
for(int a=0;a<32;a++){
int wr=getCode(tree,s.substring(0,a+1));
if(wr==-1)continue;
106106
else{
outf.write(wr);s=s.substring(a+1);break;
}
}
}totalBytes++;
}s=s.substring(0,(s.length()-8));s=s.substring(0,(s.length()-8+arr));
int counter;while (s.length()>0){
if(s.length()>16)counter=16;else counter=s.length();for(int a=0;a<counter;a++){
int wr=getCode(tree,s.substring(0,a+1));
if(wr==-1)continue;else{
outf.write(wr);s=s.substring(a+1);break;
}}
}
outf.close();
}catch(IOException eofexc){
System.out.println("IO error");}
107107
summary+="Compressed size : "+ mycount+" bytes.";summary+="\n";
summary+="Size after decompressed : "+table.originalSize()+" bytes.";
summary+="\n";
}
private int getCode(HuffmanNode node,String decode) {
while (true){
if (decode.charAt(0)=='0'){
node=node.lchild;}else{
node=node.rchild;}if (node.lchild==null&&node.rchild==null){
return node.getValue();}if(decode.length()==1)break;decode=decode.substring(1);
}return -1;
}
public String toBinary(int b) { int arr[]=new int[8];
String s=""; for(int i=0;i<8;i++) { arr[i]=b%2;
b=b/2; }
for(int i=7;i>=0;i--) {
108108
s+=arr[i]; }
return s; }
public int toInt(String b) { int output=0,wg=128;
for(int i=0;i<8;i++) {
output+=wg*Integer.parseInt(""+b.charAt(i));wg/=2;
} return output;
}public int getCurrent(){
return totalBytes;}public int lengthOftask(){
return mycount;}public String getSummary(){
return summary;}
}
DLnode.java
public class DLNode
109109
{private DLNode next,prev;private HuffmanNode elem;
public DLNode(){
next=null;prev=null;elem=null;
}public DLNode(DLNode next,DLNode prev,HuffmanNode elem){
this.next=next;this.prev=prev;this.elem=elem;
}
public DLNode getNext(){
return next;}public DLNode getPrev(){
return prev;}public void setNext(DLNode n){
next=n;}public void setPrev(DLNode n){
prev=n;}public void setElement(HuffmanNode o){
elem=o;}public HuffmanNode getElement(){
return elem;}
} HuffmanNode.java
import java.io.*;
110110
public class HuffmanNode implements Serializable{
public HuffmanNode rchild,lchild,up;private String code;private int freq;private int value;public HuffmanNode(String bstring,int freq,int value,HuffmanNode
lchild,HuffmanNode rchild,HuffmanNode up){
code=bstring;this.freq=freq;this.value=value;this.lchild=lchild;this.rchild=rchild;this.up=up;
}public HuffmanNode(){
code="";freq=0;value=0;lchild=null;rchild=null;
}public int getFreq(){
return freq;}public int getValue(){
return value;}public String getCode(){
return code;}
}
]PriorityQueue.java
public class PriorityQueue
111111
{
private DLNode head,tail;private int size=0;private int capacity;private HuffmanNode obj[];public PriorityQueue(int cap){
head=new DLNode();tail=new DLNode();head.setNext(tail);tail.setPrev(head);capacity=cap;obj=new HuffmanNode[capacity];
}public PriorityQueue(){
head=new DLNode();tail=new DLNode();head.setNext(tail);tail.setPrev(head);capacity=1000;obj=new HuffmanNode[capacity];
}public void insertM(HuffmanNode o)throws Exception{
if (size==capacity)throw new Exception("Queue is full");
if (head.getNext()==tail){
DLNode d=new DLNode(tail,head,o);head.setNext(d);tail.setPrev(d);
}else{
DLNode n=head.getNext();HuffmanNode CurrenMax=null;int key=o.getFreq();while (true){
if (n.getElement().getFreq()>key){
112112
DLNode second=n.getPrev();
DLNode huf=new DLNode(n,second,o);second.setNext(huf);n.setPrev(huf);break;
}if (n.getNext()==tail){
DLNode huf=new DLNode(tail,n,o);n.setNext(huf);tail.setPrev(huf);break;
}n=n.getNext();
}}
size++;}
public HuffmanNode removeFirst() throws Exception{
if(isEmpty()) throw new Exception("Queue is empty");
HuffmanNode o=head.getNext().getElement();DLNode sec=head.getNext().getNext();head.setNext(sec);sec.setPrev(head);size--;return o;
}public HuffmanNode removeLast() throws Exception
{if(isEmpty())
throw new Exception("Queue is empty");DLNode d=tail.getPrev();HuffmanNode o=tail.getPrev().getElement();tail.setPrev(d.getPrev());d.getPrev().setNext(tail);size--;return o;}
113113
public boolean isEmpty(){
if(size==0)return true;return false;
}public int sizeQ(){
return size;}public HuffmanNode first()throws Exception
{if(isEmpty())
throw new Exception("Stack is empty");return head.getNext().getElement();}
public HuffmanNode Last()throws Exception{if(isEmpty())
throw new Exception("Stack is empty");return tail.getPrev().getElement();}
}
Table.java
114114
import java.io.*;
class Table implements Serializable{
private String FileName;private int fileSize,arr[],size=0,front=0;public Table(int fileSize,String FileName){
arr=new int[256];this.FileName=FileName;this.fileSize=fileSize;
}public void push(int c){
if(size>256)System.out.println("Error in record");
arr[size]=c;size++;
}public int originalSize(){
return fileSize;}public int pop(){
if(size<1)System.out.println("Error in record");
int rt=arr[front++];size--;return rt;
}
public String fileName(){
return FileName;}public int recSize(){
return size;}
}
REFERENCES
115115
TITLE AUTHOR
1. Data compression Khalid Sayood2. Data compression Mark Nelson3. Foundations of I.T D.S yadav4. Complete Reference Java Herbert Schildt5. OOPS in java E Balagurusamy6. Java programming Krishnamoorthy7. Software Engineering Pressman8. Software Engineering Pankaj Jalote
.
WEBSITES:-
1. http://www.google.com
2. http://www.wikipedia.org
3.http://www.nist.gov
ENCLOSED:
Soft copy of the project in C.D.
116116