See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/311969898 Steganographic Method for Data Hiding in Microsoft Word Documents structure by a Change Tracking Technique Thesis · May 2009 CITATION 1 READS 121 2 authors, including: Some of the authors of this publication are also working on these related projects: Steganography Approaches Based on Mix Column Transform Technique View project both of them View project Abdul Monem S. Rahma University of Technology, Iraq 98 PUBLICATIONS 64 CITATIONS SEE PROFILE All content following this page was uploaded by Abdul Monem S. Rahma on 30 December 2016. The user has requested enhancement of the downloaded file.
141
Embed
Steganographic Method for Data Hiding in Microsoft … · To the City of Science and its Teacher ………… Prophet "Mohamed" To my injured ... Miss Hacker. Linguistic ... Hiding
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/311969898
Steganographic Method for Data Hiding in Microsoft Word Documents
structure by a Change Tracking Technique
Thesis · May 2009
CITATION
1
READS
121
2 authors, including:
Some of the authors of this publication are also working on these related projects:
Steganography Approaches Based on Mix Column Transform Technique View project
both of them View project
Abdul Monem S. Rahma
University of Technology, Iraq
98 PUBLICATIONS 64 CITATIONS
SEE PROFILE
All content following this page was uploaded by Abdul Monem S. Rahma on 30 December 2016.
The user has requested enhancement of the downloaded file.
1.1 Information Hiding Hierarchy…………………………………. 4 1.2 Generic digital watermarking scheme………………………….. 5 1.3 Watermarking example…………………………………………. 6 1.4 A data hiding example………………………………………….. 9 2.1 Steganography basic model…………………………………….. 13 2.2 Steganography Types…………………………………………… 14 2.3 Text Hiding methods…………………………………………… 25 2.4 Color quantization……………………………………………… 30 2.5 Halftone quantization…………………………………………... 31 2.6 Huffman Tree for example…………………………………….. 35 2.7 Huffman tree for the 26-letter Alphabet……………………….. 36 3.1 Word Versions for Different Operating System……………….. 38 3.2 External Structure of a Word Document………………………. 41 3.3 Track Change Example………………………………………… 43 3.4 Comments Example…………………………………………… 43 3.5 File Structure Types……………………………………………. 45 3.6 logic view of file……………………………………………….. 47 3.7 Storage and Streams structure………………………………………… 48 3.8 Sample Word document storage format……………………….. 50 3.9 The structure of Hard Disk……………………………………. 54 3.10 MS Compound files structure………………………………… 64 3.11 Word Object Model…………………………………………. 66 3.12 Platform Invokes call to an unmanaged Dll Function…………. 67 4.1 Block Diagram for Proposed System ……………………………70 4.2 Screenshot of Microsoft Word in case of collaborative document authoring…………………………………………………………71 4.3 Author A sends a stegodocument S to a recipient B…………….72 4.4 Hiding Algorithm Flowchart…………………………………….76 4.5 Search Unused Block Algorithm Flowchart……………………. 80 4.6 Extracting Algorithm Flowchart…………………………………83
5.1 Word Reference…………………………………………………. 87 5.2 Block diagram for Unused Block path in Document file……….. 89 5.3 The main menu for the proposed system……………………….. .90 5.4 Cover Document before Track Change………………………… 90 5.5 Cover Document after Track change…………………………… 91 5.6 The Embedding Process Window………………………………. 94 5.7 Document after Hiding…………………………………………. 94 5.8 Extracting Process Window……………………………………. 95
LLiisstt ooff TTaabblleess
Table Name Description Page No.
2.1 Steganography Attacks.…………………………………….... 32 2.2 Probabilities of occurrence in English language.……………. 37 3.1 MCDFF Metadata...…………………………………………. 53 3.2 Compound document header structure……………………… 56 3.3 Header (block1)—512(0x200) bytes ……………………….. 57 3.4 Directory entry structure…………………………………….. 60 3.5 Property – 128(0x80) byte block……………………………. 61 3.6 Block Allocation Table.……………………………………... 63 3.7 Office 2003 applications and component type libraries…….. 65 5.1 Comparisons between the proposed system and other text hiding methods………………………………………………………….. 96
Glossary Terms Description
1 Byte order The order in which single bytes of a bigger data type are represented or stored.
2 Compound document
File format used to store several objects in a single file, objects can be organized hierarchically in storages and streams.
2 Compound document header
Structure in a compound document containing initial settings.
5 Control stream Stream in a compound document containing internal control data.
6 Directory List of directory entries for all storages and streams in a compound document
7 Directory entry Part of the directory containing relevant data for a storage or a stream.
8 Directory entry identifier (DirID)
Zero-based index of a directory entry.
9 Directory stream Sector chain containing the directory.
10 DirID Zero-based index of a directory entry
11 End Of Chain SecID
Special sector identifier used to indicate the end of a SecID chain.
12 File offset Physical position in a file.
13 Free SecID Special sector identifier for unused sectors
14 Header Short for “compound document header”.
15 Master sector allocation table (MSAT)
SecID chain containing sector identifiers of all sectors used by the sector allocation table.
16 MSAT SecID Special sector identifier used to indicate that a sector is part of the master sector allocation table.
17 Red-black tree Tree structure used to organise direct members of a storage.
18 Root storage Built-in storage that contains all other objects (storages and streams) in a compound document.
19 Root storage Directory entry representing the root storage.
22 SecID Zero-based index of a sector (short for “sector identifier”).
23 SecID chain An array of sector identifiers (SecIDs) specifying the sectors that are part of a sector chain and thus enumerates all sectors used by a stream.
24 Sector Part of a compound document with fixed size that contains any kind of stream (user stream or control stream) data.
No. Subject No. of page
1 Chapter One : General Introduction and Survey
1.1 Introduction 1 1.2 Information Hiding History 2 1.3 Information Hiding Hierarchy 4
1.4 The Difference between Cryptography, Steganography and Watermarking 6
1.5 Information Hiding Applications 7 1.6 Literature Survey 9 1.7 Aim of Thesis 11 1.8 Thesis Outlines 12
2 Chapter Two : Steganography 2.1 Introduction 13 2.2 Steganography Basic Model 13 2.3 Steganography Types 14 2.3.1 Pure Steganography 14 2.3.2 Secret Key Steganography 15 2.3.3 Public Key Steganography 16 2.4 Steganography Algorithms 16 2.4.1 Spatial Domain Based Steganography 16 2.4.2 Transform Domain Based Steganography 17 2.4.3 Document Based Steganography 18 2.4.4 File Structure Based Steganography 18 2.5 Steganography Under various Media 18 2.5.1 Hiding in Disk Space 18 2.5.2 Hiding in Network Packets 19 2.5.3 Hiding in Software and Circuity 20 2.5.4 Hiding in Video 20 2.5.5 Hiding in Audio 20 2.5.6 Hiding in Image 21 2.5.7 Hiding in Text 21 2.6 Classification of Text Hiding Techniques 21 2.7 Steganalysis 31 2.8 Attacks are available to the Steganalyst 32 2.9 Introduction to the code 33 2.10 Why Encode the Data 33 2.11 Huffman Coding 34
3 Chapter Three: Microsoft Word Document File
3.1 Introduction 38 3.2 History of Word 39 3.3 Microsoft Word Document and its Components 41 3.4 Annotation and collaboration Tools 42 3.4.1 Track Changes 42 3.4.2 Comments 43 3.5 File Format 44 3.6 Identify the Type of a File 44 3.6.1 Filename Extension 44 3.6.2 Magic Number 45 3.7 File Structure 45 3.7.1 Raw Memory Dumps/Unstructured Formats (RMD) 46 3.7.2 Chunk Based Formats (CBF) 46 3.7.3 Directory Based Formats (DBF) 46 3.8 Structure Storage 47 3.9 Microsoft Compound Document File Format(MCDFF) 49 3.10 Structure of a Word Documents files 50 2.11 Format of the Main Stream 52 3.12 MCDFF metadata 53 3.12.1 Compound Document Header 55 3.12.2 Byte Order 58 3.12.3 Sector File Offset 59 3.12.4 Property Table (Directory) 59 3.12.5 Block Allocation Table (BAT) 62 3.12.6 Sector Allocation Table (SAT) 64 3.13 Office Automation 64 3.14 PIA for Microsoft Office 2003 65 3.15 Word Object Model 65 3.16 Platform Invoke (PInvoke) 67 3.17 Application Programming Interface (API) 68 3.18 Office Application Programming Interface (APIs) 68
4 Chapter Four : Proposed Hiding System in Document File
4.1 Introduction 69 4.2 Cover Generation Process 71 4.3 Embedding Process 73
5 Chapter Five : Experimental Results and Discussion
5.1 Introduction 84 5.2 System Implementation 90
5.2.1 Document before Hiding 90 5.2.2 Embedding Process 91 5.2.3 Document after Hiding 94 5.2.4 Extracting Process 95
5.3 Comparisons between proposed system and the most popular hiding methods 96
6 Chapter Six : Conclusions and Suggestions for Future Work
6.1 Conclusions 97 6.2 Suggestions for Future Work 98 Glossary References I Appendix A II Appendix B III Appendix C
Hiding messages is nothing new over the past years; multitudes of
methods have been used to hide information. One of the first documents
describing steganography is from the histories of Herodotus. In ancient
Greece, the text was written on wax covered tablets. To avoid capture, he
scraped the wax off the tablets and wrote a message on the underlying
wood. He then covered the tables with wax again. The tables appeared to
be blank and unused so they passed inspection by sentries without question
[JOH99].
Historically various steganographic techniques have been used
including:
I. Tattoo. A Roman general that shaved the head of a slave
tattooing a message on his scalp. When the slave's hair grew
back, the general dispatched the slave to deliver the hidden
message to its intended recipient [DIC07].
II. Character marking. Select letters of printed or typewritten text
are over written in pencil. The marks are ordinarily not visible
unless the paper is held at an angle to bright light [DOB97].
III. Invisible ink. From the 1st century through World War II
invisible inks were often used to conceal hidden messages. A
number of substances (milk, vinegar, fruit juices and urine) can
Survey and Introduction General neO Chapter
3
be used for writing. They leave no visible trace until heat or some
chemical is applied to the paper.
IV. Pin punctures. Small pin punctures on selected letters are
ordinarily not visible unless the paper is held up in front of a
light [DOB97].
V. Microfilm. While Paris was under siege in 1870, messages were
sent by carrier pigeon. A Parisian photographer used a microfilm
technique to enable each pigeon to carry a higher volume of data
[DIC07].
VI. Null ciphers (unencrypted message) were also used. In this
method the first letter of each word spells out a message. But
messages are very hard to construct [KAH96].
The following message was actually sent by a German spy during
Second World War [RIM97].
"Apparently neutral's is thoroughly discounted and ignored.
Isman hard hit. Blockade issue affects pretext for embargo
on by- products, ejecting suets and vegetable oils".
Decoding this message by taking the second letter in each word reveals
the following secret message:
"Perishing sails from NY June 1".
Survey and Introduction General neO Chapter
4
11..33 IInnffoorrmmaattiioonn HHiiddiinngg HHiieerraarrcchhyy Information Hiding (IH) is a kind of technique in the area of
information security. It is a technique to secretly embed information into
digital contents such as images, audios, movies, document, so that it cannot
be visually or audibly perceived, a data hiding example can be shown in
figure (1.4) [YOS06].
The Terminology which was agreed at first international workshop on
this subject in Figure (1.1) [CAC98]:
Covert channels in the context of multilevel secure systems (e.g.
military computer systems),as communication paths were neither
designed nor intended to transfer information at all these channels
typically used by untrustworthy programs to leak information to their
owner while performing a service for another program [KAT00].
Anonymity is finding ways to hide the Metacontent of messages,
that is, the sender and the recipients of a message [KAT00].
IH
Copyright marking
Steganography Anonymity Covert channels
Robust Fragile Copyright Watermarking
fingerPrinting
Watermarking
Figure (1.1) Information hiding hierarchy
Survey and Introduction General neO Chapter
5
Steganography an important sub discipline of information hiding is
art and science of communicating in a way which hides the existence
of the communication [KAT00].
Fingerprinting is a term that denotes special applications of
watermarking. It relates to watermarking application which
information such as the creator or recipient of digital data is
embedded as watermarks [KAT00].
In contrasting to Steganography, Copyright marking guarantees
that embedded data can be reliably detected after the image has been
modified (but not destroyed beyond recognition) [CAC98].
Watermarking is the process of embedding information into digital
multimedia content such that the information (which we call the
watermark) can later be extracted or detected for a variety of
purposes including copy prevention and control, an example of
watermarking can be shown in figure(1.3) [BAK05].
Watermark host Data Watermark Data secter/public key (K)
Marking Algorithm
Figure (1.2) Generic digital Watermarking scheme [KAT00]
There are several approaches to classify watermarking systems. One could
categorize them according to the watermarking powerful against types of
attack.
Survey and Introduction General neO Chapter
6
Fragile Watermarks are watermarks that have only very limited
robustness. The embedded watermarks will change, or disappear, if a
watermarked object is altered. This type of watermark can be used
for authentication purpose to verify the originality of watermarked
object [BAK05].
Robust watermarking is designed to survive "moderate to severe
signal processing attacks". In such a way that any signal transform of
reasonable strength cannot remove the watermark. Robust
watermarks are public able in image copyright protection and
fingerprinting [BAK05].
Figure (1.3) watermarking example [ROC08]
11..44 TThhee DDiiffffeerreenncceess bbeettwweeeenn CCrryyppttooggrraapphhyy,, SStteeggaannooggrraapphhyy aanndd WWaatteerrmmaarrkk.. The cryptographer's interest is primarily with obscuring the content of
a message, but not the communication of the message. The steganographer,
on the other hand is concerned with hiding the very communication of the
message, while the digital watermarked attempts to add sufficient metadata
to a message to establish ownership, provenance, source, etc. Cryptography
and steganography share the feature that the object of interest is embedded,
Survey and Introduction General neO Chapter
7
hidden or obscured, whereas the object of interest in watermarking is the
host or carrier which is being protected by the object that is embedded,
hidden or obscured. Further, watermarking and steganography may be used
with or without cryptography; and imperceptible watermarking shares
functionality with steganography, whereas perceptible watermarking does
not [BER06].
11..55 IInnffoorrmmaattiioonn HHiiddiinngg AApppplliiccaattiioonnss [XIU06] The advantages of information hiding technology have been applied in
many prospects, including e-commerce, electronic transaction protection,
Computer-based steganographic techniques introduce changes to digital
carriers to embed information foreign to the native
Carriers of such message may resemble innocent sounding text, disks
and storage devices, network traffic and protocols the way software or
circuits are arranged, audio, images, video, or any other digitally
represented code or transmission [JOH01].
2.5.1 Hiding in Disk space [MIK07]
Another way to hide information relies on finding unused space that
is not readily apparent to an observer. T
18
Chapter Two Steganography
information without perceptually degrading the carrier. The way operation
systems store files typically results in unused space that appears to be
allocated to files. Another method of hiding information in file system is to
create a hidden partition. These partitions are not seen if the system is
tarted normally. However, in many cases, running a disk configuration
e in rnet. Any
of these packets can provide a covert communication channel. The packet
hat can be manipulated to hide
s
utility exposes the hidden partition. These concepts have been expanded in
a novel proposal of a steganographic file system. If the user knows the file
name and password, then access is granted to the file; otherwise, no
evidence of the file exists in the system of the hidden files.
2.5.2 Hiding in Network packets [JOH01]
Various network protocols have characters that can be used to hide
information. TCP/IP packets are used to transport information; an
uncountable number of packets are transmitted daily over th te
headers have unused space or other values t
information. However, filters can be set to detect information in the
"unused" or reversed spaces. One way to circumvent this detection is to
take advantage of information in the headers that typically go unchecked
by most systems. Such information includes the values for sequence and
identification numbers.
19
Chapter Two Steganography
2.5.3 Hiding in software and circuitry
Data can also be hidden based on the physical arrangement of a
carrier. The arrangement itself may be an embedded signature that is
nique to the creator. An example of this is in the layout of code distributed
circuits on a board, this type of
"marking" can be used to uniquely identify the design origin and cannot be
mov
hide data. Due to the size of video files, the scope
for adding lots of data is much greater and therefore the chances of hidden
e and a range of frequencies greater than one
thousand to one making it extremely hard to add or remove data from the
u
in a program or the layout of electronic
re ed without significant change to the network [JOH01].
2.5.4 Hiding in video
For video, a combination of sound and image techniques can be
used. This is due to the fact that video generally has separate inner files for
the video (consisting of many images) and the sound. So techniques can be
applied in both areas to
data being detected is quite low [CUM04].
2.5.5 Hiding in Audio
Data hiding in audio signals is especially challenging, because the
Human Auditory System (HAS) operates over a wide dynamic range. To
put this in perspective, the (HAS) perceives over a range of power greater
than one million to on
original data structure. The only weakness in the (HAS) comes at trying to
differentiate sounds (loud sounds drown out quiet sounds) and this is what
must be exploited to encode secret messages in audio without being
detected [DUN02].
20
Chapter Two Steganography
2.5.6 Hiding in Image
Given the proliferation of digital images, especially on the Internet,
nd given the large amount of redundant bits present in the digital
age, images are the most popular cover objects for
steganography [MOR00].
s as hosts for steganographic messages takes
advantage of the limited capabilities of the human visual system. Encoding
Important point must be said that the embedding task in text requires
user; it therefore cannot be automated, while image
and audio can embed the data directly and automatically according to its
ways have been proposed to hide information directly in text
Syntactic method: where the structure of sentences is transformed
a
representation of an im
Using image file
extra data in an image file changes pixels in the image, but these changes
would remain imperceptible to the human eye [BER05].
2.5.7 Hiding in Text
Written Text can be used as a method to transmit secret messages.
Only small amounts of data can be hidden when hiding data in text. Thus,
this method is known to have a common low data rate.
the interaction of the
algorithm.
22..66 CCllaassssiiffiiccaattiioonn ooff TTeexxtt HHiiddiinngg TTeecchhnniiqquueess::-- Steganograhy methods can try to encode the information directly in the
text or in the text format as shown in figure (2.3).
I. Encoding Information Directly in the Text
Many
like Syntactic, Semantics, P.Waynar, Chapman, Translation and HTML.
without significantly altering their meaning. This method utilizes
punctuation, diction [VIL06].
21
Chapter Two Steganography
ample of using punctuation: Ex
Th
consid
appears before the "and" this represents as a "1" and the second phrase
represents as a "0"[ALS01].
ructure of the text:
e sentence this will encode as a "1",when an
is will be encoded as a
ilobytes of text,
ader and changing the
be considered primary and the word "large" is
ver, syntactic and semantic methods are not suitable for all types
ents, literary texts) and need,
e phrase "bread, butter, and milk" and "bread, butter and milk" are both
ered correct usage of commas as a list, such that when the comma
Example of using Diction and st
The sentence "Before the night is over, I will finish" and
The sentence "I will finish before the night is over"
This method is more transparent than the punctuation method .When a verb
comes at the beginning of th
adverbial comes at the beginning of the sentence th
"0"[ALS01].The expected data rate only several bits per k
use of punctuation is noticeable to even casual re
punctuation will impact the clarity and even the meaning of the text so this
can be considered as a Disadvantage of using Punctuation.
Semantics Method
Where words are replaced by their synonyms and/or sentences are
transformed via suppression or inclusion of noun phrase coreferences
[VIL06].
Example of using Semantic Method
The word "big" could
considered secondary. Decoding primary words will be read as ones,
secondary words as zero [ALS01].
Howe
of documents (e.g. contracts, identity docum
in general, human supervision [VIL06].
22
Chapter Two Steganography
P.Wayner Method
Peter Wayner proposed a Mimic Function which exploits the
tatistical profile of a message, since the stego-objects are created only
ccording to statistical profile, the semantic component are entirely
nored.
Wayner described one of the most promising techniques, he uses
(CFG) to create cover-text and chooses the productions according to the
chniques [KAT00].
complished by the use of a parse tree for the
T and SCRAMBLE. Given a large dictionary of
ords out of the
s
a
ig
secret message to be transmitted, the secret information is not embedded in
the cover, and the cover itself is the secret message. If the grammar is
unambiguous the receiver can extract the information by applying standard
parsing te
Wayner proposed an extension to the technique of mimic function,
given a set of production, assigning a probability to each possible
production. The sender then constructs a Huffman compression function
and converts the secret message to a binary bit. The receiver then parses the
cover in order to reconstruct the productions which have been used in the
embedding step; this can be ac
given CFG [ALS01].
But the vulnerable aspect of this technique is difficult to select
meaningful type categories without considering the eventual grammatical
requirements of a natural-language style-source [ALD05].
Chapman and Davida Method
Chapman and Davida proposed a system which consists of two
functions, NICETEX
words of different types, and a style source, which describes how words of
different types can be used to form a meaningful sentence, NICETEXT
transforms secret message bits into sentence by selecting w
23
Chapter Two Steganography
dictionary which conform to a sentence structure given in style source
[ALS01].
SCRAMBLE reconstructs the secret if the dictionary which has been used
is known. Style resources can either be created from natural-language
entence or be generated using CFG [ALS01].
he most obvious problem with the manual method is that it takes too long
s with thousands of words [ALD05].
tion process, especially in
resulting from translation-
ased steganography are inconspicuous. The translation-based approach,
how s [LIU07].
ed until the source
f the page is revealed [KAT00].
s
T
to enter large lists. Nicetext focuses on creating large, sophisticated
dictionarie
Translation- based steganography
Use the expected errors in the transla
machine translation, to solve the issue of producing implausible text;
information is hidden in the noise that occurs in language translation. In
cases where sending imperfect translations to a
b
ever, may be vulnerable to active attack
HTML
Information is hidden in HTML files by adding useless spaces and
line breaks or by changing the case of letters in the tags [JOH98].
Html files are good candidates for including extra spaces but Web
browses ignore these "extra" spaces and they go unnotic
o
24
Chapter Two Steganography
Figure (2.3) Text hiding method
Text Hiding
Techniques
Encoding Information Directly inThe Text
Encoding Information
In The Tex Format
Semantic
method
Syntax
method
P.Wayner
method
ChapmanDaivdeamethod
Feature
encoding
Line-shift
encoding
Word-shift
encoding
Open-space
encoding
Binary code
Binary
code
Binary code
Binary code
Binary code
Binary code
Binary code
Binary code
Translation based
Steganography
HTML
Color quantizati
on
Halftonequantizat
ion
Binary code
Binary code
Binary code Binary
code
25
Chapter Two Steganography
43
Chapter Two Steganography
II. Encoding Information in the Text Format [ALS01].
Information can be embedded in the format rather than in the
message itself. secret information can be stored in the size of inter-line
or inter-word spaces. If the spaces between two lines are smaller than
some threshold, a "0" is encoded; otherwise a "1" is encoded. Infrequent
additional white space characters are introduced to form the secret
message.
Open Space method
Encode through manipulation of white space (unused space) on the
printed page. There are three methods for using white space to encode
data.
Inter-Sentence Spacing [ALS01].
This method deals with encoding a binary message into a text by
placing one or two spaces after the sentence, such that one space
represents "0" and two spaces represent "1".
The disadvantage of this method is that it is insufficient, requiring a
great deal of text to encode a very few bits(one bit per sentence).This
equates to a data rate of approximately one bit per 160 bytes assuming
sentences are on average two 80 character lines of text. Its ability to
encode depends on the structure of the text and many word processors
automatically set the number of spaces after periods to one or two
characters.
A. End-of-line spaces [ALS01].
This method deals with inserting spaces at the end of lines. The data
are encoded allowing for a predetermined number of spaces at the end
of each line. This method has a number of advantages in that it goes
unnoticed by readers and the amount of hidden information is maximum
26
Chapter Two Steganography
than inter-sentence method and the disadvantage like some programs
like "sendmails" may in advertently remove the extra space characters.
B. Inter-Word-Spaces [ALS01].
Using the white space to encode data involves right justification of
text. One space between words is interpreted as a "0".Two spaces are
between words are interpreted as a "1". This method has a number of
advantages like changing the number of trailing space, there is little
chance of changing the meaning of a phrase or sentence and the casual
reader is unlikely to take notice of slight modifications in white space.
The disadvantage is that if the reader does not notice its manipulation,
then the word processor may inadvertently change the number of
spaces, destroying the hidden data.
Line-Shift Coding
In this method, text lines are vertically shifted (moved up or down)
according to the secret message bits, whereas other lines are kept
stationary for the purpose of synchronization. If a line is moved up, a
"1" is encoded; otherwise a "0" is encoded [DUC01].
The disadvantage of this method is that it represents the most visible
text coding technique to the reader; large documents encode a few bits
(one bit per line) and the need for the original message may decrease the
security of the system [ALS01].
Word-shift Coding [ALS01]
In this method, codewords are coded into a document by shifting the
horizontal or vertical locations of words within text lines, while
maintaining a natural spacing appearance.
This method is only applicable to documents with variable spacing
between adjacent words.
27
Chapter Two Steganography
as a result of this variable spacing, it is necessary to have the original
image, or to at least know the spacing between words in the un encoded
document.
A. Encode Codeword (Horizontal Shift- Word)
For each text line, the largest and the smallest spaces between words
are found. It is possible to alter every space between two words
[ALS01].
For example take the Sentence1:
We explore new steganographic and cryptographic algorithms and
techniques throughout the world to produce wide variety and security
in the electronic web called the Internet
Applying some horizontal shifting word algorithm to obtain the
following sentence
Sentence 2:
We explore new steganographic and cryptographic algorithms and
techniques throughout the world to produce wide variety and security
in the electronic web called the Internet.
By overlapping the two sentences, obtain the following:
We explore new steganographic and cryptographic algorithms and
techniques throughout the world to produce wide variety and
security in the electronic web called the Internet.
This is achieved by expanding the space before wide, web by one point
and condensing the space after explore, the world by one point in
sentence1,the sentences containing the shifted words appear harmless,
but combining this with the original sentence produces a different
message: explore the world wide web.
In the same method, can encode binary message instead of encoded
word. For example, if expand the space before explore, the world,
28
Chapter Two Steganography
wide, web by one point, this will be encoded as "1", and if condense
the space after explore, the world, wide, web by one point, this will be
encoded as "0".
By applying random horizontal shifts to all words in the document, an
attacker could eliminate the encoding.
B. Encode Codeword (Vertical Shift- Word)
Shifting the vertical locations of words can be used to help identify
an original document. A similar method can be applied to display an
entirely different message [ALS01].
For example take the following sentence:
We explore new steganographic and cryptographic algorithms and
techniques throughout the world to produce wide variety and security
in the electronic web called the Internet.
Applying some vertical shifting word algorithm to obtain the following
sentence:
We explore new steganographic and cryptographic algorithms and
techniques throughout the world to produce wide variety and security
in the electronic web called the Internet.
In the same method, can encode binary message instead of encoded
word. For example, if shift up the words explore, the world, this will
be encoded as "1", and if we shift down the words wide, web this will
be encoded as "0".
Feature Encoding
Where feature such as Shape, Size, or Position are manipulated .In
this method certain text features are altered, or not altered depending on
the codeword. For example, one could encode bits into text by
extending or shortening the upward, vertical end lines of letters such as
29
Chapter Two Steganography
b, d, h, etc. generally before encoding, feature randomization takes
place. Character end line lengths would be randomly lengthened or
shortened, then altered again to encode the specific data. This removes
the possibility of visual decoding, as the original end line lengths would
not be known to code, one requires the original image.
Examples of using feature coding
Long d can be decoded of as "1" short d can be decoded as "0".
Long h can be decoded of as "1" short d can be decoded as "0".
Long b can be decoded of as "1" short d can be decoded as "0".
This method has a number of advantages like high amount of data
encoding, largely indiscernible to the reader; the disadvantage is that the
feature coding can be defeated by adjusting each endline length to fixed
value [ALS01].
Color quantization [VIL06]
The main idea of this method is to quantize the color or luminance
intensity of each character in such a manner that the human visual
system is not able to distinguish between the original and quantized
characters, but it can be easily performed by a specialized reader
machine. An example illustrating this method is shown in Figure (2.4).
Therein, dark characters encode a 0, whereas light ones encode a 1. A
binary sequence can be sequentially embedded into the cover text.
Notice that the embedding rate is comparatively higher than the rate of
inter-line or inter-word space modulation methods.
VAMOS A TRABAJAR
(a)
VAMOS A TRABAJAR
0 1 0 1 1 0 0 1 0 0 0 1 0 1
(b)
Figure (2.4) .Color quantization: (a) original text; (b) marked text (exaggerated)
30
Chapter Two Steganography
Halftone Quantization [VIL06]
This method relies on half toning, a widely used printing technology
that enables continuous tone images to be printed with one color ink
(grayscale) or a few color inks (color). Here, the discussion is restricted to
black & white printers.
In order to simulate a given gray shade a halftone printer uses a
halftone screen. This method exploits the fact that there exist several
possible choices for the halftone screen leading to the same gray shade.
Therefore, one can use this property in order to hide data on each text
character by using a different halftone screen according to the message m
that wishes to embed. The major strength of this method is that all
characters in the stego text will have the same grade shade. This method is
intended mainly for printed documents.
(a) (b) (c)
Figure (2.5) Halftone quantization: (a) Original character; (b) marked character for m = 0;
(c) Marked character for m = 1.
22..77 SStteeggaannaallyyssiiss A goal of steganography is to avoid drawing suspicion to the
transmission of hidden message. If suspicion is raised, this goal is defeated.
Steganalysis is the art of discovering and rendering useless such covert
message [JOH01].
In other words steganalysis attempts to detect the existence of hidden
information [ALS01].
31
Chapter Two Steganography
the steganlyst is one who applies a stganalysis in an attempt to detect the
existence of hidden information and /or render it useless. Two aspects of
steganalysis involve the detection and distortion of embedded messages
Detection requires that the analyst observes various relationships between
combinations of cover, message, stego-media, and steganograghy tool.
Distortion attacks require that the analyst manipulates the stego-media to
render the embedded information useless or remove it altogether [ETT98].
22..88 AAttttaacckkss aarree aavvaaiillaabbllee ttoo tthhee SStteeggaannaallyysstt There are many possible situations which confront the Steganalyst,
depending on what information is available. The different cases are shown
in table (2.1) [JAJ98]: Table (2.1) Steganography Attack
1-Stego-only attacks: only the stego-object is available for analysis.
2-Known cover attack: the "original" cover-object and stego-object are
both available.
3-Known message attack: At some point, the attacker may know the
hidden message. Analyzing the stego-object for patterns that correspond
to the hidden message may be beneficial for future attacks against that
system. Even with the message, this May be very difficult and may even
be considered equivalent to The Stego-only attack.
4-Chosen stego attack: The steganograghy tool (algorithm) and Steg-
object is known.
5-Chosen message attack: the steganalyst generates stego-object from
some steganography tool or algorithm from a chosen message. The goal
in this attack is to determine corresponding patterns in the stego-object
that may point to the use of specific steganography tools or algorithms.
32
Chapter Two Steganography
6-known stego attack: The steganography algorithm (tool) is known and
both the original and stego-objects are available.
22..99 IInnttrroodduuccttiioonn ttoo tthhee CCooddee [ABD01] A code is nothing more than a set of strings over a certain alphabet. For
example, the set C= {0, 10, 110, 1110} is a code over the alphabet {0, 1}.
Of course, codes are generally used to encode message. For instance, it
may use the set C to encode the first four letters of the alphabet, as follows:
a 0
b 10
c 110
d 1110
Then can encode words (or messages) built up from these letters. The word
"cab", for instance, is encoded as
cab 110010
22..1100 WWhhyy EEnnccooddee tthhee DDaattaa [KUO70] There are three reasons to encode data that is about to be transmitted
(through space, for instance) or stored (on computer disk, for instance).
The first reason is for efficiency. It clearly makes sense to compress data
as much as possible in order to save transmission time or storage space. In
fact, data compression is very big business in the computer world. The
second reason to encode data is for error detection and /or correction.
The third reason is for secrecy, so that unauthorized persons cannot read
the data.
In other words, the goals of encoding are for efficiency, error correction,
and secrecy.
33
Chapter Two Steganography
22..1111 HHuuffffmmaann CCooddiinngg There are different ways of encoding data and one of these ways is
Huffman coding [Web06].
In 1952, D.A.Huffman published a method for constructing highly
efficient instantaneous encoding schemes. This method is now known as
Huffman Encoding [ROM96].
The idea behind Huffman coding is simply to use shorter bit patterns
for more common characters, and longer bit patterns for less common
characters [Web06].
The method starts by building a list of the entire alphabet symbols in
descending order of their probabilities .It then constructs a tree with a
symbol at every leaf, from the bottom up. This is done in steps where, at
each step, the two symbols with smallest probabilities are selected, added
to the top of the partial tree, deleted from the list, and replaced with an
auxiliary symbol representing both of them. When the list is reduced to
just one auxiliary symbol (representing the entire alphabet) the tree is
complete [SAL95].
An Example [Web06]
To encode the letters A (0.12), E (0.42), I (0.09), O (0.30), U
(0.07), listed with their respective probabilities. Go through the
following steps:
1. Consider each of the letters as a symbol with its respective
probability.
2. Find the two symbols with the smallest probability and
combine them into a new symbol with both letters by adding
34
Chapter Two Steganography
the probabilities. (Note1: There may be a choice between two
symbols with the same probability, if this is the case, a symbol
can be chosen, the final tree and codes will be different, but the
overall efficiency of the code will be the same)
(Note 2: Frequency counts or other values may be used instead of
probabilities)
3. Repeat step 2 until there is only one symbol left with a
probability of 1.
4. To see the code, redraw all the symbols in the form of a tree,
where each symbol contains either a single letter or splits up
into two smaller symbols. Label all the left branches of the
tree with a 0 and all the right branches with a 1. The code for
each of the letters is the sequence of 0's and 1's that lead to it
on the tree, starting from the symbol with a probability of 1.
Figure (2.6) Huffman Tree for example
5. Thus the codes for each letter are:
A = 100, E = 0, I = 1011, O = 11, U = 1010.
35
Chapter Two Steganography
The Huffman code for the 26- letter Alphabet [ROM96]
000 E 0.1300 0
0010 T 0.0900 0 0. 3 0 0
0011 A 0.0800 1 1
0100 O 0.0800 0
0101 N 0.0700 1 0 0.580
0110 R 0.0650 0 0.28 1
0111 I 0.0650 1 1
10000 H 0.0600 0
10001 S 0.0600 1 0
10010 D 0.0400 0 0.195 0
10011 L 0.0350 1 1 0
10100 C 0.0300 0 0.305
10101 U 0.0300 1 0 1
10110 M 0.0300 0 0.11
10111 F 0.0200 1 1
11000 P 0.0200 0
11001 Y 0.0200 1 0
11010 B 0.0150 0 0.70 0
11011 W 0.0150 1 1
11100 G 0.0150 0 0 0.115 1
11101 V 0.0100 1 0.025
111100 J 0.0050 0 1
111101 K 0.0050 1 0.010 0 1 0.045
111110 X 0.0050 0 0.020
1111110 Q 0.0025 0 0.010 1
1111111 Z 0.0025 1 0.005 1
Figure (2.7) Huffman tree for the 26-letter Alphabet
36
Chapter Two Steganography
Table (2.2) shows the letters of the alphabet with approximate
probabilities of occurrence in English, based on statistical data. The
second columns of the table show Huffman encoding scheme
(emphasizing table (2.2)) is used in this work) [ROM96].
Table (2.2) Probabilities of Occurrence in English Text
Symbol Probability Huffman code E 0.1300 000 T 0.0900 0010 A 0.0800 0011 O 0.0800 0100 N 0.0700 0101 R 0.0650 0110 I 0.0650 0111 H 0.0600 10000 S 0.0600 10001 D 0.0400 10010 L 0.0350 10011 C 0.0300 10100 U 0.0300 10101 M 0.0300 10110 F 0.0200 10111 P 0.0200 11000 Y 0.0200 11001 B 0.0150 11010 W 0.0150 11011 G 0.0150 11100 V 0.0100 11101 J 0.0050 111100 K 0.0050 111101 X 0.0050 111110 Q 0.0025 1111110 Z 0.0025 1111111
37
33
CChhaapptteerr TThhrreeee
`||vvÜÜÉÉááÉÉyyàà jjÉÉÜÜww WWÉÉvvââÅÅxxÇÇàà
YY||ÄÄxx YYÉÉÜÜÅÅttàà ))AAwwÉÉvv(*(*
Chapter Three Microsoft Word Document File Format
38
CChhaapptteerr TThhrreeee
MMiiccrroossoofftt WWoorrdd DDooccuummeenntt ffiillee 33..1111IInnttrroodduuccttiioonn Microsoft Word is a word processing software, many word versions
were written for several platforms1 including IBM PC running DOS, the
Apple Macintosh and Microsoft Windows as shown in Figure(3.1).
It is a component of the Microsoft Office System; Microsoft began
calling it Microsoft Office Word instead of merely Microsoft Word.
0
20
40
60
80
100
120
140
1983 1986 1989 1991 1995 1998 2000 2003 2006 2008
MS-DOSMacintoshWindows
ijjgjgg
Wor
d V
ersi
ons N
umbe
r
Years of Issuing
Figure (3.1) Word Versions for Different Operating Systems
1Platform: the underlying Hardware or Software for a System
Chapter Three Microsoft Word Document File Format
33..22 22HHiissttoorryy ooff WWoorrdd Many concepts and ideas of Word were brought from Bravo the
original GUI word processor developed at Xerox PARC1 [Web08].
Bravo's creator Charles Simonyi left PARC to work for Microsoft in
1981. Simonyi hired Richard Brodie, who had worked with him on Bravo,
away from PARC that summer [Web02].
Word featured a concept of "What You See Is What You Get", or
WYSIWYG, and was the first application with such features as the ability
to display bold and italics text on an IBM PC. Word made full use of the
mouse, which was so unusual at the time that Microsoft offered a bundled
Word-with-Mouse package [Web08].
Although MS-DOS was a character-based system, Microsoft Word
was the first word processor for the IBM PC that showed actual line breaks
and typeface markups such as bold and italics directly on the screen while
editing, although this was not a true WYSIWYG system because available
displays did not have the resolution to show actual typefaces[Web02].
Word 97
Word 97 had the same general operating performance as later
versions such as Word 2000. This was the first copy of Word featuring the
"Office Assistant"2, which was an animated helper used in all Office
programs [Web08].
Word 2000
For most users, one of the most obvious changes introduced with
Word 2000 (and the rest of the Office 2000 suite) was a clipboard3 that
could hold multiple objects at once. Another noticeable change was that the
2 1:Xerox PARC Research and Development Company 1970 2:Office Assistant animated helper used in all office programs
39
3: clipboard a special file or memory area (buffer) where data is stored temporary before being copied to another location used for copy and paste.
Chapter Three Microsoft Word Document File Format
Office Assistant, whose frequent unsolicited appearance in Word 97 had
annoyed many users, was changed to be less intrusive [Web08].
Word 2002
Word 2002 was bundled with Office XP and was released in 2001
although its appearance was different; it had many of the same features as
Word 2003. One of the key advertising strategies for the software was the
removal of the Office Assistant in favor of a new help system, although it
was simply disabled by default Word 2002[Web08].
Word 2003
For the 2003 version, the Office programs, including Word, were
rebranded to emphasize the unity of the Office suite, so that Microsoft
Word officially became Microsoft Office Word. Users continue to use both
names [Web08].
Word 2007
The release includes numerous changes, including a new XML-
based file format, a redesigned interface, and an integrated equation editor
[Web08].
Word 2008
Word 2008 is the most recent version of Microsoft Word for the
Mac, released on January 15, 2008. It includes some new features from
Documents in Word have a hierarchical structure as shown in the
figure (3.2)
Figure (3.2) External Structure of a Word Document Different types of properties apply to different units in hierarchy:
Section. By default a document is a single section, but setting for
margins, headers and footers, footnote, and columns apply to
whole sections so need a section break to change any of these for
only part of a document. Make a new section using Inset| Break
and selecting one of the four types of "section breaks".
Paragraph. most of formatting in Word applies at the paragraph
level indents, line spacing, default font properties, bullets etc. can
apply many aspects of paragraph formatting all at once to a
paragraph using paragraph styles .
Character. Some formatting attributes apply at the level of
individual character, such as the bold font in the first word of this
paragraph can apply a set of character attributes together using
character styles.
41
Chapter Three Microsoft Word Document File Format
In addition to these parts of the main document, there are other special
kinds of text which word refers to as other "stories". These include
footnotes, comments, headers and footers, these items are stored separately
from the main text and require special commands to access and edit.
Customizations. such as definitions, macros and toolbars may either
be stored in the document or in the document's associated template
Styles. Are collections of format specifications which can be applied
all together to a paragraph or a group of characters. The advantage of
using styles to apply formatting is that can easily change the
formatting of all paragraphs of a certain type (e.g. examples, section,
heading or footnotes) simply by redefining the style. A linguistics
paper usually goes through a number of stages: as a term paper. As a
draft you circulate for comments as a conference handout, as a
journal submission, as camera-ready copy for a volume. Each of
these stages has its own format requirements. Using styles right from
the beginning for all formatting can save a huge amount of time over
a paper.
33..44 AAnnnnoottaattiioonn aanndd ccoollllaabboorraattiioonn ttoooollss [Web11] As a linguist, will often be working together with someone else on a
document either as a co-author, or in a student-teacher relationship.
Word has some easy-to-use tools to facilitate such collaborative work.
3.4.1 Track Changes
The “Track Changes” tool gives access to a simple method of keeping
track of the changes a particular user makes to a document. Insertions will
display in color and underlined; deletions and format changes will display
in bubbles like comments, an example of Track change can be shown in
figure (3.3) [web11]. 42
Chapter Three Microsoft Word Document File Format
Track Changes is a way for Microsoft Word to keep track of the changes
you make to a document. Track Changes is also known as redline, or
redlining. This is because some industries traditionally draw a vertical
red line in the margin to show that some text has changed [web04].
Figure (3.3) Track change example
3.4.2 Comments
The “Comment” feature allows comments to be added to the
document. In Page Layout view, recent versions of Word will be display
comments in "bubbles" on the right side of the text (moving text over to
make room in the margin for the comment). Comments from different
reviewers will appear in different colors, comments example in figure (3.4)
[web11].
Figure (3.4) comments example
43
Chapter Three Microsoft Word Document File Format
33..55 FFiillee FFoorrmmaatt [Web03]
A file format is a particular way to encode information for storage in
a computer file.
Since a disk drive, or indeed any computer storage, can store only bits,
the computer must have some way of converting information to 0s and 1s
and vice-versa. There are different kinds of formats for different kinds of
information. Within any format type, e.g., word processor documents, there
will typically be several different formats. Sometimes these formats
compete with each other.
Some file formats are designed to store very particular sorts of data:
the JPEG format for example, is designed only to store static photographic
images other file formats, however, are designed for storage of several
different types of data.
33..66 IIddeennttiiffyyiinngg tthhee ttyyppee ooff aa ffiillee [Web03]
Since files are seen by programs as streams of data, a method is
required to determine the format of a particular file within the file system
an example of metadata. Different operating systems have traditionally
taken different approaches to this problem, with each approach having its
own advantages and disadvantages as follows.
3.6.1 Filename Extension
One popular method in use by several operating systems, including
DOS and Windows, is to determine the format of a file based on the section
of its name following the final period. This portion of the filename is
44
Chapter Three Microsoft Word Document File Format
Known as the filename extension For example, HTML documents are
identified by names that end with .html (or .htm) [Web03].
3.6.2 Magic Number
An alternative method, often associated with UNIX and its
derivatives, is to store a "magic number" inside the file itself. Originally,
this term was used for a specific set of 2-byte identifiers at the beginning of
a file, but since any un decoded binary sequence can be regarded as a
number, any feature of a file format which uniquely distinguishes it can be
used for identification. GIF images, for instance, always begin with the
ASCII representation of either GIF87a or GIF89a, depending upon the
standard to which they adhere [Web03].
33..77 FFiillee SSttrruuccttuurree
Each format uses structure (a way to organize data for storing) in a file
[FOL98].
There are several types of ways to structure data in a file. The most
usual ones are described in figure (3.5).
File structure
Raw memory dumps
Chunk based format
Directory based format
(RMD) (CBF) (DBF)
Figure (3.5) File Structure Types
45
Chapter Three Microsoft Word Document File Format
3.7.1 Raw Memory Dumps/Unstructured Formats (RMD) [Web03]
Earlier file formats used raw data formats that consisted of directly
dumping the memory images of one or more structures into the file.
This has several drawbacks. Unless the memory images also have
reserved spaces for future extensions, extending and improving this type of
structured file is very difficult. On the other hand, developing tools for
reading and writing these types of files are very simple.
The limitations of the unstructured formats led to the development of
other types of file formats that could be easily extended and be backward
compatible at the same time.
3.7.2 Chunk based Formats (CBF) [Web03]
In this kind of file structure, each piece of data is embedded in a
container that contains a signature identifying the data, as well the length of
the data (for binary encoded files). This type of container is called a chunk.
The signature is usually called a chunk id, chunk identifier, or tag
identifier.
With this type of file structure, tools that do not know certain chunk
identifiers simply skip those that they do not understand. Even XML can be
considered a kind of chunk based format, since each data element is
surrounded by tags which are akin to chunk identifiers.
3.7.3 Directory based Formats (DBF) [web03]
This is another extensible format, that closely resembles a file system
(OLE Documents are actual file systems), where the file is composed of
46
Chapter Three Microsoft Word Document File Format
'directory entries' that contain the location of the data within the file itself
as well as its signatures (and in certain cases its type). Good examples of
these types of file structures are disk images, OLE documents [Web03].
33..88 SSttrruuccttuurree SSttoorraaggee The lowest level of organization that is normally imposed on a file is a
stream of bytes.
By storing data in a file which is merely as a stream of bytes, the ability to
distinguish among the fundamental information units of data will be lost.
These fundamental pieces of information are called fields. Fields are
grouped together to form records. Records are grouped together to form
Block [FOL98] as shown in figure (3.6).
In persisten
treated as a
the disk. T
file system
Block
Record
0, 1
Field Stream of bits
Figure (3.6) logic view of file
47
t storage, normally files are stored in the form of bytes. A file is
raw sequence of bytes. The entire file is stored in the blocks on
hese blocks are scattered on the disk. When reading this file, the
manages its pointers and returns a sequence of bytes [CHA00].
Chapter Three Microsoft Word Document File Format
Structure storage follows a different approach to store a file and its data on
the persistent storage. Structure storage provides a way by defining how to
treat a file as a structured collection of objects. These objects are storages
and streams as shown in figure (3.7).
Root
STORAGE STORAGE STORAGE STREAM
STORAGE STREAM STREAM
Figure (3.7) Storage and Stream Structure
A storage object is kind of a directory and it can contain other storage
objects and stream objects that can be thought of as a stream object as a
file. Like a file, a stream contains data stored as a consecutive sequence of
bytes. A compound file is a combination of these two objects [CHA00].
A compound file is file which contains different types of data saved in a
structured format having a compound file which has some text, some
images and other data. Now we want to add one more object to a file. In the
traditional approach, when saving a file, the file system rewrites the entire
data. But the structured storage approach eliminates this rewriting process
and increases the read/write performance. The new data is written to the
next available location in permanent storage and the storage object updates
the table of pointers it maintains to track the locations of its storage objects
and stream objects [CHA00].
48
Chapter Three Microsoft Word Document File Format
Here are some other benefits:
Structured storage approach provides control over separate
objects. It can read/write separate objects instead of the entire
compound file [CHA00].
More than one user can concurrently read/write the same file
[CHA00].
33..99 MMiiccrroossoofftt CCoommppoouunndd DDooccuummeenntt FFiillee FFoorrmmaatt ((MMCCDDFFFF)) A word file may contain Excel sheet and chart, an image, a table, and
some macros is an example of compound file.
Files which use MCDFF (Microsoft Compound Document File
Format) include output files from MS Office 97-2003, which consist of the
applications like MS Word, PowerPoint, and Excel [CHA00].
The Microsoft Compound Document File Format (MCDFF) 2003 is a
document file format based on OLE (Object Linking and Embedding),
which is used for saving various resources as an integrated document in
Microsoft [MIC07].
A storage component may exist as a standalone component. Each
storage component may have one or more sub-storage components and
stream components. Also the root component may have stream components
directly within it [JIT06].
49
Chapter Three Microsoft Word Document File Format
50
33..1100 SSttrruuccttuurree ooff aa WWoorrdd DDooccuummeennttss ffiilleess
Let's take a look at the structure of a Word document with an embedded
Excel object, shown below in Figure (3.8).
MS Word
JPEG Image
Object Pool
Word Document
Data Table Summary Information
Document Summary Information
CompObj
Excel Sheet
Work SummaryInformation
DocumentSummary
Book Information
Figure (3.8) Sample of Word document storage format
The binary format for Microsoft Word 97 and later versions is based on
a structure referred to as a .doc file or compound file.
A Word .doc file consists of a [MIC07]:
I. Word Document (Main stream)
II. Summary information stream
III. Table stream
IV. Data stream
V. Custom XML storage (Added in Word 2007)
Zero or more object streams which contain private data for OLE 2.0
objects embedded within the Word document [MIC07].
The 'MS Word' component is the root component containing several
streams and one storage item. Different parts of the document such as the
Chapter Three Microsoft Word Document File Format
actual contents, any table inserted, the CompObj associated with the DLL
files for the objects, the Summary Information for the content, any image
inserted, and the Document Summary Information, all take the form of
streams under the root component. The ObjectPool is the collective storage
of all the sub-storage components. Figure (3.8) displays samples of the sub-
storage Excel component. The Excel Sheet itself is a storage component
within the ObjectPool and has its own streams of information the
Workbook, SummaryInformation and DocumentSummaryInformation
[JIT06].
Custom XML Datastore (Added in Word 2007): The custom XML
data store specifies custom defined XML files contained in the
binary Microsoft Word 97 format or the Office Open XML Formats
[MIC07].
Data stream: The stream within a Word .doc file that contains
various data that anchor to characters in the main stream. For
example, binary data are described in-line pictures and/or form fields
[MIC07].
Main stream: The stream within a Word .doc file that contains the
bulk of Word‘s binary data [MIC07].
Object storage: A storage that contains binary data for an embedded
OLE 2.0 object. Multiple instances are referred to as storages
[MIC07].
Stream: The physical encoding of a Word document's text and sub
data structures in a random access stream within a .doc file [MIC07].
Summary Information Stream: The stream within a Word .doc file
that contains the document summary information [MIC07].
51
Chapter Three Microsoft Word Document File Format
Table stream: The stream within a Word .doc file that contains the
various plcf‘s and tables that describe a document‘s structures
The main stream of a Word binary file (complex format) consists of
the Word file header (FIB), the text, and the formatting information.
FIB (File Information Block)
The header of a Word file begins at offset 0 in the file. This gives
the beginning offset and lengths of the document's text stream and
subsidiary data structures within the file. It also stores other file status
information.
The FIB contains a "magic number" and pointers to the various
other parts of the file, as well as information about the length of the file.
The FIB is defined in the structure definition section of this document
[MIC07].
Text
The text part contains all text of the document (including footnotes,
header and footer lines, etc.) the document's text is also located in the main
stream [DIA08].
Word has used this same file format since its first version. This means
that Word 1.0 can read Word 5.0 files and vice-versa. This compatibility
was accomplished by defining all structures to be larger than they needed
to be and setting all reserved fields to zero for using in future versions.
52
Chapter Three Microsoft Word Document File Format
Reserved pointers in the document header have been used to add entirely
new document sections (such as document retrieval information and
bookmark tables) [Web09].
Because of the important issue of compatibility with future versions, all
fields in all structures which are not currently being used MUST be filled
with zeros. When the fields are finally defined for a new feature, they will
make zero either the default value of those fields or make zero represent un
initialized state which will be ignored [Web09].
33..1122 MMCCDDFFFF mmeettaaddaattaa MCDFF uses metadata to manage information about Streams,
Storage. Table (3.1) describes the type of information contained in each
metadata in MCDFF [HYU08]. Table (3.1) MCDFF Metadata
Name of metadata Information Contained Header Signature, Pointer Table of BAT
BAT Block Allocation Table
SBAT Small Block Allocation Table
Directory Stream & Storage information
The exact format structure of these metadata was provided by the
Spreadsheet Project of Open Office.org Documentation of the Microsoft
Compound Document File Format [DAN07] and the Apache POIFS
Project of Apache.org. [MAR07] because POIFS file systems are called
"file system", because they contain multiple embedded files in a manner
similar to the traditional file systems if had a word processor file with the
extension ".doc", would actually have a POIFS file system with a
document file archived inside of that file system. [MAR07].Most
53
Chapter Three Microsoft Word Document File Format
operating systems, including Microsoft Windows manage hard disk
drives by dividing their storage space into units known as partitions. So
before being able to store data on a partition, it must be formatted.
Formatting a partition organizes the associated space into what is called a
filesystem, which provides space for storing the names and attributes of
files as well as the data they contain. Microsoft Windows supports
several types of filesystems, such as FAT and FAT32,Formatting a disk
divides the disk into tracks and sectors, each track is divided into sectors
sometimes called disk blocks as shown in figure (3.9) where Partitions
comprise the logical structure of a disk drive, the way humans and most
computer programs understand the structure. However, disk drives have
an underlying physical structure that more closely resembles the actual
structure of the hardware.
Figure (3.9) the structure of a hard disk [MCC99]
MCDFF uses two types of data unit: Small Block (Sector) and Big Block
(Block) [HYU08].
If the Stream size is less than 4096, the file is stored in small blocks and
the SBAT is used to walk the small blocks (Sector) making up the file.
If the file size is 4096 or larger, the file is stored in big blocks (Blocks)
54
Chapter Three Microsoft Word Document File Format
and the main BAT is used to walk the big blocks making up the file
[MAR07].
The (zero-based) index of a sector is called sector identifier (SecID)
SecIDs are signed 32-bit integer values. If a SecID is not negative, it must
refer to an existing sector. If a SecID is negative, it has a special meaning.
–1 Free SecID Free sector, may exist in the file, but is not part of
any stream [DAN07].
–2 End Of Chain SecID Trailing SecID in a SecID chain
–3 SAT SecID Sector is used by the sector allocation table
–4 MSAT SecID Sector is used by the master sector allocation
table.
3.12.1 Compound Document Header The compound document header (simply “header” in the
following) contains all data needed to start reading a compound
document file. The header is always located at the beginning of the file;
this implies that the first sector (with SecID 0) always starts at file offset
512.The first 64 bits of the header form id or magic number identifier of
office file.
The header also contains an array of block numbers. These block
numbers refer to blocks in the file. When these blocks are read together
they form the Block Allocation Table. The header also contains a pointer
to the first element in the property table, also known as the root element,
and a pointer to the small Block Allocation Table (SBAT) [MAR07].
The block allocation table or BAT, along with the property table
specifies which blocks in the file system belong to which files [MAR07].
The Contents of the compound document header structure are
described in the following Table.
55
Chapter Three Microsoft Word Document File Format
Table (3.2) compound document header structure [DAN07]. offset Size Contents 0 8 Compound document file identifier: D0 CF 11 E0 A1 B11AE1 8 16 Unique identifier (UID) of this file 24 2 Revision number of the file format (most used is 003E) 26 2 Version number of the file format (most used is 0003) 28 2 Byte order identifier FEH FFH = Little-Endian
FFH FEH = Big-Endian 30 2 Size of a sector in the compound document file in power-of-two
(ssz), real sector size is sec_size = 2ssz bytes (minimum value is 7 which means 128 bytes, most used value is 9 which means 512 bytes)
32 2 Size of a short-sector in the short-stream container stream in power-of-two (sssz), ) real short-sector size is short_sec_size = 2sssz bytes (maximum value is sector size ssz, see above, most used value is 6 which means 64 bytes)
34 10 Not used 44 4 Total number of sectors used for the sector allocation table 48 4 SecID of first sector of the directory stream 52 4 Not used 56 4 Minimum size of a standard stream (in bytes, minimum allowed
and most used size is 4096 bytes), streams with an actual size smaller than (and not equal to) this value are stored as short-streams
60 4 SecID of first sector of the short-sector allocation table or -2 (End Of Chain SecID) if not extant
64 4 Total number of sectors used for the short-sector allocation table
68 4 SecID of first sector of the master sector allocation table or -2 (End Of Chain SecID) if no additional sectors used
72 4 Total number of sectors used for the master sector allocation table
76 436 First part of the master sector allocation table containing 109 SecIDs
The following header format structure in Table (3.3) is used to give Block
information if the file is stored in Block.
Note: The shadow cells in Table (3.3) are used in this work.
Note: The previous equation is used to calculate Block Position too.
3.12.4 Property Table (Directory)
The Property Table is essentially nothing more than the directory
system. Properties (directories) are 128 byte records contained within the
512 byte blocks. Each directory entry refers to storage or a stream in the
compound document. the zero-based index of a directory entry is called
directory entry identifier (DirID). There is a special directory entry at the
beginning of the directory (with the DirID 0). It represents the root
storage and is called root storage entry [DAN07]. The contents of the
directory entry structure are described in the following table.
59
Chapter Three Microsoft Word Document File Format
Table (3.4) directory entry structure [DAN07]
Offset Size Contents 0 64 Character array of the name of the entry, always 16-bit Unicode
characters, with trailing zero character (results in a maximum name length of 31 characters)
64 2 Size of the used area of the character buffer of the name (not character count), including the trailing zero character (e.g. 12 for a name with 5 characters: (5+1)·2 = 12)
66 1 Type of the entry: 00H = Empty 03H = LockBytes (unknown) 01H = User storage 04H = Property (unknown) 02H = User stream 05H = Root storage
67 1 Node colour of the entry: 00H = Red 01H = Black
68 4 DirID of the left child node inside the red-black tree of all direct members of the parent storage (if this entry is a user storage or stream), –1 if there is no left child
72 4 DirID of the right child node inside the red-black tree of all direct members of the parent storage (if this entry is a user storage or stream), –1 if there is no right child
76 4 DirID of the root node entry of the red-black tree of all storage members (if this entry is a storage), –1 otherwise
80 16 Unique identifier, if this is a storage (not of interest in the following, may be all 0)
96 4 User flags (not of interest in the following, may be all 0) 100 8 Time stamp of creation of this entry. Most implementations do not
write a valid time stamp, but fill up this space with zero bytes. 108 8 Time stamp of last modification of this entry. Most implementations
do not write a valid time stamp, but fill up this space with zero bytes. 116 4 SecID of first sector or short-sector, if this entry refers to a stream
,SecID of first sector of the short-stream container stream, if this is the Root storage entry,0 otherwise
120 4 Total stream size in bytes, if this entry refers to a stream, total size of the short stream container stream, if this is the root storage entry, 0 otherwise
124 4 Not used The following property Format Structure in Table (3.5) is used to give
Block information if the file is stored in Block.
Note: the shadow cells in Table (3.5) are used in this work.
Field Description Offset Length Default value or const
NAME A unicode null-terminated uncompressed 16bit string (lose the high bytes) containing the name of the property.
0x00, 0x02, 0x04, ... 0x3E
Short[] 0x0000 for unused elements, field required, 32 (0x40) element max
NAME_SIZE Number of characters in the NAME field
0x40 Short Required
PROPERTY_TYPE Property type (directory, file, or root)
0x42 Byte 1 (directory), 2 (file), or 5 (root entry)
NODE_COLOR Node color 0x43 Byte 0 (red) or 1 (black)
PREVIOUS_PROP Previous property index
0x44 Integer -1
NEXT_PROP Next property index 0x48 Integer -1 CHILD_PROP First child property
index 0x4c Integer -1
SECONDS_1 Seconds component of the created timestamp?
0x64 Integer 0
DAYS_1 Days component of the created timestamp?
0x68 Integer 0
SECONDS_2 Seconds component of the modified timestamp?
0x6C Integer 0
DAYS_2 Days component of the modified timestamp?
0x70 Integer 0
START_BLOCK Starting block of the file, used as the first block in the file and the pointer to the next block from the BAT
0x74 Integer Required
SIZE Actual size of the file this property points to. (Used to truncate the blocks to the real size).
0x78 Integer 0
61
Chapter Three Microsoft Word Document File Format
3.14.5 Block Allocation Table (BAT)
The BAT (Block Allocation Table) is the main table for spaces
within MCDFF, which is needed to read any other Stream in the file
[HYU08].
The BAT blocks are pointed at by the bat array contained in the
header these blocks form a large table of integers. These integers are
block numbers. The Block Allocation Table holds chains of integers
[MAR07].
The elements in these chains refer to blocks in the files. The
starting block of a file is NOT specified in the BAT. It is specified by
the property of a given file. The elements in this BAT are both the block
number (within the file minus the header) and the number of the next
BAT element in the chain. This can be thought of as a linked list of
blocks. The BAT array contains the links from one block to the next,
including the end of chain marker [MAR07]. The BAT format structure
is shown in Table (3.6).
Here's an example: Let's assume that the BAT begins as follows:
BAT [0] = 2
BAT [1] = 5
BAT [2] = 3
BAT [3] = 4
BAT [4] = 6
BAT [5] = -1
BAT [6] = 7
BAT [7] = -2
62
Chapter Three Microsoft Word Document File Format
Now, if we have a file whose Property Table entry says it begins with
index 0, walk the BAT array and see that the file consists of blocks 0
(because the start block is 0), 2 (because BAT[ 0 ] is 2), 3 (BAT[ 2 ] is
3), 4 (BAT[ 3 ] is 4), 6 (BAT[ 4 ] is 6), and 7 (BAT[ 6 ] is 7). It ends at
block 7 because BAT [7] is -2, which is the end of chain marker.
Similarly, a file beginning at index 1 consists of blocks 1 and 5 and
block 5 refers to unused block.
The other special number in a BAT array is:
-3, which indicate a "special" block, such as a block used to make
up the Small Block Array, the Property Table, the main BAT, or
the SBAT [MAR07].
Table (3.6) Block Allocation Table Block [MAR07]
Field Description Offset Length Default value or const
BAT_ELEMENT Any given element in the BAT block
0x0000, 0x0004, 0x0008, ... 0x01FC
Integer -1 = unused -2 = end of chain -3 = special (e.g., BAT block) All other values point to the next element in the chain and the next index of a block composing the file.
In the physical structure of an MCDFF file, each Block is numbered with
an index number under a Header. Figure (3.10) shows the process of
accessing “Sample A Stream”. The first index number for “Sample A
Stream” is included in its Directory entry. It accesses the BAT to find the
index number of the other Blocks that “Sample A Stream” uses – in this
Example, if the first index number is 1st in Directory Entry, “Sample A
Stream” consists of three Blocks as 1st, 4th and 5th from BAT [HYU08].
33..1155 WWoorrdd OObbjjeecctt MMooddeell Word provides hundreds of objects. These objects are organized in a
hierarchy that closely follows the user interface.
Word Visual Basic Helps to contain a diagram of Word's object
model. The figure is "live" – when clicking on an object you will be taken
to the Help topic for that object. Figure (3.11) shows the portion of the
object model diagram that describes the Document object [GRA01].
The Key object in Word is Document, which represents a single, open
document; the Document object has lots of properties and methods. Many
of its properties are references to collections such as Paragraphs, Tables
and Sections. Each of these collections contains references to objects of the 65
Chapter Three Microsoft Word Document File Format
indicated type, each object contains information about the appropriate piece
of the document. For example, the Paragraph object has properties like
KeepWithNext and Style, as well as methods like Indent and Outdent
[GRA01].
Figure (3.11).Word Object Model – The Word Visual Basic Help file offers a global
view of Word's structure [GRA01].
66
Chapter Three Microsoft Word Document File Format
33..1166 PPllaattffoorrmm IInnvvookkee ((PPIInnvvookkee)) There is a need to call a function located in an unmanaged DLL
library from within the .NET framework. Platform invokes or PInvoke is
the technique used to make this happen [Web01].
Figure (3.12) a platform invokes call to an unmanaged DLL function [Web01].
When platform invoke calls an unmanaged function, it performs the
following sequence of actions [Web01]:
I. Locates the DLL containing the function.
II. Loads the DLL into memory.
III. Locates the address of the function in memory and pushes its
arguments onto the stack, marshaling data as required.
Note Locating and loading the DLL, and locating the address of
the function in memory occur only on the first call to the function.
67
IV. Transfers control to the unmanaged function.
Chapter Three Microsoft Word Document File Format
68
33..1177 AApppplliiccaattiioonn PPrrooggrraammmmiinngg IInntteerrffaacceess ((AAPPII)) [Web12] An API is a set of functions that can be used to work with a
component, application, or operating system. Typically, an API consists of
one or more DLLs that provide some specific functionality.
DLLs are files that contain functions that can be called from any
3) 2 bytes containing the size of sectors (small Block) or size of Block (big block) the size is 512 bytes, 2 bytes containing the size of short-sectors or size of small Block size is 64 bytes here.
5) 4 bytes containing the SecID of the first sector used by the directory or Block index of the first block of the property table. It starts at sector or Block 63 here
6) 4 bytes containing the minimum size of standard streams. This size is 00100000H = 4096 bytes here. This leads to the file stored in big Blocks and the main BAT is used to walk the big blocks making up the file.
00000040H 01 00 00 00 FE FF FF FF 00 00 00 00 3e 00 00 00
7) 4 bytes containing the Block index of BAT it starts at block 62 here. The second step: finding starting block of a file specified by the property (Directory) its size is 128 bytes: 00008000: 52 00 6F 00 6F 0074 00 20 00 45 00 6E 007400 00008010: 72 00 79 00 0000 00 0000 00 00 00 00 000000 00008020: 00 00 00 00 00 00 00 00 00 00 0000000000 00 00008030: 00 00 00 00 00 0000 00 00 00 00 000000 0000 00008040: 16 00 05 01 FF FF FF FF FF FF FF FF 03 0000 00 00008050: 06 09 02 00 000000 00 c0 00 0000 00 0000 46 00008060: 00 00 00 00 00 0000 00 00 00 00 00 e012 2f4e 00008070: b8 48 c9 01 42 0000 00 80 0000 0000 00 0000
1) 64 bytes containing the character array of the entry name (16-bit characters, terminated by the first <00> character. The name of this entry is "Root Entry" here.
2) 4 bytes containing starting block of the file, used as the first block in the file and the pointer is to the next block from the BAT.
The third step: Loading BAT array to accessing the Unused Block and hiding the secret message in it. The Block allocation Table will be for this cover 0 1 2 3 4 5 6 7 8 9 10 11 12 … … 1 2 3 4 5 6 7 8 9 10 11 12 … -1 …
The secret message will be:
THERE ARE ELEVEN GUARDS OUT TWENTY IN COUNTER AT
TEN PM FROM CELING OUR TARGET DIAMOND
The fourth step: Encoding the secret message with Huffman Coding:
ppooppuullaarr TTeexxtt hhiiddiinngg MMeetthhooddss This work differs from other Text hiding Methods by the following: Table (5.1) Comparison between the proposed System and other Text hiding Methods
THE PROPOSED SYSTEM TEXT HIDING SYSTEMS 1. The difference between
document after hiding and Stegodocument which is opposite on apparent Text is not found.
The difference between Cover and Stegodocument which is opposite on apparent Text is found in hiding method like interline, inter Word
2. The hidden data is not related to Text Cover it can be English or Arabic Text.
Hidden data may be related to Text Cover.
3. No problem was detected on hidden data at Stegodocument mailing or copying.
Some programs like "send mails" may in advertently remove the extra space characters in space hidden data.
4. Must access Binary File Format that describes exactly how the data is to be encoded, how accessing to Unused Block to hiding data.
Does not need to know Binary File Format.
5. Using Track Change Tool does not affect hidden data.
This Tool has not yet been used in related work.
6. Could not be detected by the Software that detect any change with character Feature.
Can be detected by that Software
7. In this work, it was found that: Cover Size=34KB Hidden Size= 63Byte Informed about size of empty document = 10/11 KB
Taking the Open Space method, Inter-Sentence Spacing requires a great deal of text to encode a very few bits (one bit per sentence). This equates to a data rate of approximately one bit per 160 bytes assuming sentences are not on average to 80 character lines of Text.
Appendix C Structure of File Information Block (FIB) [MIC07] In Word version 8, the FIB is reorganized to make future extension easier, and to make it easier to make backward compatible file format changes. The FIB now consists of four substructures: the header and three arrays. The FIB header, is unchanged from past versions. The second part is an array of 16-bit ―shorts, most of which were present in earlier versions in different locations. The third part is an array of 32-bit longs, many of which were scattered through the previous version FIB. Finally, there is an array of FC/LCB pairs, which were divided into several disjoint arrays in the previous FIB. Future versions of Word will add entries to the three arrays, so readers of the FIB must be careful to skip over any entries in each array that were not present in the version for which the reader was designed. Writers of the FIB must write exactly as many entries as was defined for the nFib value they put in the FIB. The FIBFCLCB structure, used in an array in the FIB: Deximal Hex Name Type Bitfield
Size Bitfield size
Comments Introduced
0 0x0000 Fc Long File position where data begins.
4 0x0004 Lcb ulong Sizeof Data.Ignore fc if lcb is zero
The FCPGDOLD structure, referenced in the FIB, used internally by Word: Deximal Hex Name Type Bitfield
Size Bitfield size
Comments Introduced
0 0x0000 FcPgd Long File position where data begins.
4 0x0004 LcbPgd ulong Sizeof Data.Ignore fc if lcb is zero
8 oxoooc fcBkd long File position where data begins.
12 0xoooc lcbBkd ulong Size of data.Ignore fc if lcb is zero
The FCPGD structure, referenced in the FIB, used internally by Word. This modified version of the above structure was introduced in Word 2003: Deximal Hex Name Type Bitfield
Size Bitfield size
Comments Introduced
0 0x0000 FcPgd Long File position where data begins.
Word 2003
4 0x0004 LcbPgd ulong Sizeof Data.Ignore fc if lcb is zero
Word 2003
8 oxoooc fcBkd long File position where data begins.
Word 2003
12 0xoooc lcbBkd ulong Size of data.Ignore fc if lcb is zero
Word 2003
16 0x0010 fcAfd Fc File position where data begins
Word 2003
ßaßaخخþþ@@––óó@@
االمن ليس مسؤولية أو أمتياز الحراس أو وآالء االمن فقط ,للفرد والمجتمع والعالم االمن مطلب
االمن اهتمام آل شخص حيث ان ابقاء الباب مغلق هي مسؤولية آل شخص يمر خالل ذلك الباب
.صنفه أو وضعه في الحياة,لونه , بغض النظر عن طوله
ا بحت ابح صال اص ب وات ع الوي ل مواق سبب آ ات ب ة البيان اث امني ب ابح ات قل اء البيان ث اخف
.الصوت والصورة وهكــــــــــــــــــــــذا,الشبكات يعتمد على الفيديو
ة االدراك دون اضعاف نوعي سرية بوسط رقمي ب تقنية أخفاء البيانات ممكن تخفي المعلومات ال
خاص ة االش ث بقي ط بحي ذلك الوس سي ل ذلك الح رية ب ات س ود معلوم درآوا بوج ن ان ي ال يمك
.الوسط
ة لنظام الحاسوب هذة االطروحة اقترحت طريقة لفن االخفاء باالستفادة من الخصائص الفيزياوي
ل ه ل ة خزن د ) .doc(وآيفي ف معق ه آمل ل ومعالجت ث فاي تخدمأبحي ةال تس ستخدمكتل ر م ة الغي
دة لملف مايكروسوفت ورد في الهيكلية الم الخفاء البيانات )ةالفارغ( تفادة من عق م االس ذلك ت وآ
.االمكانيات التي يوفرها برنامج مايكروسوفت ورد آأدواته لتوليد الغطاء
ة ي نص بطريق رح يخف ام المقت ين steganography النظ تخدام عمليت ر باس نص اخ ة : ب عملي
.عملية التضمين و الغطاء توليد
امج : عملية توليد الغطاء ائق برن 2003 اصدار ورد مايكوسوفت بما ان الغطاء هو وثيقة من وث
. انتاج جهود آتابة تعاونية بين عدة مؤلفين آانهبدور ليهظسي
ضمين ة الت صية با :عملي سلة ن ي سل ةلتخف ستخدم كتل ر م ة( ةالغي ة)الفارغ ة الثنائي ذلك بالهيكلي ل
.لملفا
ا ههذ اء ببرن ذي هو احد االطروحة قدمت نظام لالخف ورد وال ات مج ال نظام مايكروسوفت تطبيق
ي ةالمكتب ى بقي د االطالع عل ات وبع دنا ان التطبيق ه وج ات ب ة التطبيق اط ضعف عن بقي ل نق اق
.شورةنباالعتماد على اخر االبحاث الم
على نظام التشغيل وندوز اآس بي على حاسوب 2003هذا النظام نفذ باستخدام لغة السي شارب