Steganographic Method for Data Hiding in Microsoft … · To the City of Science and its Teacher ………… Prophet "Mohamed" To my injured ... Miss Hacker. Linguistic ... Hiding
Post on 11-Aug-2018
212 Views
Preview:
Transcript
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/311969898
Steganographic Method for Data Hiding in Microsoft Word Documents
structure by a Change Tracking Technique
Thesis · May 2009
CITATION
1
READS
121
2 authors, including:
Some of the authors of this publication are also working on these related projects:
Steganography Approaches Based on Mix Column Transform Technique View project
both of them View project
Abdul Monem S. Rahma
University of Technology, Iraq
98 PUBLICATIONS 64 CITATIONS
SEE PROFILE
All content following this page was uploaded by Abdul Monem S. Rahma on 30 December 2016.
The user has requested enhancement of the downloaded file.
Republic of Iraq Ministry of Higher Education and Scientific Research University of Technology Collage of Science Department of Computer Science
SStteeggaannooggrraapphhiicc MMeetthhoodd ffoorr DDaattaa HHiiddiinngg iinn MMiiccrroossoofftt WWoorrdd
DDooccuummeennttss SSttrruuccttuurree bbyy aa CChhaannggee TTrraacckkiinngg TTeecchhnniiqquuee
A Thesi Submitted To the Departmen
of the University of Technology in Part for the Degree of Master of Sci
By AAmmaannii YYoouussiiff AAll
Supervision PPrrooff.. DDrr.. AAbbdduull MMoonn
May 26, 2009 Jam
st of Computer Science
ial Fulfillment of the Requirementsence in Computer Science
--BBaagghhddaaddyy
Byeemm
ada
SS.. RRaahhmmaa
El Thaniah 2, 1430
حيمن الرحم الرهــل السمب
eŽ@@ŽãŽŠìÛa@áë@čpflë@čŠüač@fl@ŽÝrߎã@čêŠì@Ù“à×ëđñč@bèîÏ@j—ßbćčÛa@Žbj—àŽ@À@đòubuŒ@
ŽÛaòubuŽ@d×@fl@bèãćk×ì׎@ćðŠ†Ž@Ž‡Óìí@čfl@åßđñ‹v’Ž@kßflŠđò×fl@ŽŽníŒđòãì@fl@üđòîÓ‹’fl@@üëflËđòîi‹fl@Ž†bØífl@Žn팎í@bè›óŽõflë@Ûflìflflflflflflflm@@flàŽéflã@@ćŠb@
ŽãćŠìflÇ@óÜŽã@ćŠìflí@č‡èôa@ŽäÛ@Šìčêflß@flí@åfl“Žõb@flë‹›íŽla@@eألrßč@flÞ‘bäÜÛč@flë@ači@ÝØč@@
fl’øđ@flÇ@Üîćá@ @
دق اللصـــهالع ظيملي الع
سورة النور)35 (االية
AAcckknnoowwlleeddggmmeenntt
Firstly of all my great thanks to God who helped me and gave me the
ability to perform this work.
My deepest gratitude and appreciation go to my Supervisor Prof. Dr.
Abdul Monem S. Rahma for his helpful comments, his bright ideas,
technical information he provided for me, being generous with his
knowledge, who teach me exceed impossible to reach to my aim.
The guidance, advice, suggestions, kindness heart, encouragement as
well as fruitful assistance of my co-supervisor Dr. Hala B. Abdul
Wahab was of great help in finishing this Thesis.
I would like to express my gratefulness to Dr. Hilal H. Saleh Head of
computer science Department of University of Technology for offering
his encouragement.
I would like to say "thank you" to Dr. Emad K. Jabar for his parental
guidance combined with sweet objective hardness.
Special thanks and appreciation to Mr. Faiq S. Baji for his advices
and support during the period of my study, further more, this work
would not have been achieved without the support and friendship of
Esrra J. Baker and Huda Abdul Ridah AL-Safar.
I would like to thank all the staff members of Computer Science
Department specially Mss. Suham Abd in the library at the
Department. Finally, I would like to thank my family for giving me so much time
to improve myself and help me to think only of the best…………
DDeeddiiccaattiioonn
IInn tthhee nnaammee ooff GGoodd,, MMoosstt GGrraacceeffuull,, MMoosstt MMeerrcciiffuull
To the City of Science and its Teacher ………… Prophet "Mohamed"
To my injured Country………………….. …………………………….Iraq
@To the guard angle, the pure affection, school of our age and stream of
kindness who provides me with love, strength and courage, the person to
whom I am still indebted, the dearest person………….………..my Mother
To the great man who teaches me patience, and inspires me to seek the truth
and all the wonderful things I know…………………………….my Father
To those who taught me to dependent on myself to be like them, the
guidance without which my steps are aimless in the darkness, the bright
candles…………………………………………….……….…my Brothers:
(Dr. Mahmood, Dr. Ali, LT. Pilot Anwar &Stu. Ibraheem)
To who ignites my enthusiasm whenever its torch fades……..my Uncle
(Assis. Prof. Sulaiman M. Abbas Head of Electrical Eng. Dep.)
The true companions, who proved the deep meaning of friendship, who
enriched me with courage and love ……………………… my Friends:
(Huda, Dalya, Afrah, Azhar, Nuha, Issra, Zainab, Rabab, Sara, Roa'a)
To the soul of my Aunt Suham.................................................................
To everyone who helped me even with a word………………….…………..
I hope that I will be well thought of………………………………………….
The researcher
Miss Hacker@
LLiinngguuiissttiicc CCeerrttiiffiiccaattiioonn
This is to certify that this thesis entitled "Steganographic Method
for Data Hiding in Microsoft Word Documents structure by a
Change Tracking Technique" by "Amani Y. Noori " was prepared
under my linguistic supervision. Its language was amended to meet the
style of the English language.
Linguistic Supervisor
Signature:
Name: K. M. Ahmed Al-Najjar
Date: / / 2009
SSuuppeerrvviissoorr CCeerrttiiffiiccaattee
I certify that this thesis was prepared under my supervision at
Department of Computer Science in University of Technology in
a partial fulfillment of the requirements for the Master's Degree in
Computer Science.
Signature:
Name: Prof. Dr. Abdul Monem S. Rahma
Date: / / 2009
EExxaammiinniinngg CCoommmmiitttteeee CCeerrttiiffiiccaattee
This is to certify that we have read this thesis entitled, "Steganographic
Method for Data Hiding in Microsoft Word Documents Structure by a
Change Tracking Technique", and as an examining committee, examined
the student "Amani Yousif Noori", in its contents and in what is related with
it, and that in our opinion, it meets the standard of a thesis for the Degree of
Master in Computer Science at the Computer Science Department,
University of Technology with excellent grade.
Signature: Signature:
Name: Dr. Saad K. Majeed Name: Dr. Murtadha M. Hamad
(Chairman) (Member)
Date: / / 2009 Date: / / 2009
Signature: Signature:
Name: Dr. Rehab F. Hassan Name: Dr. Abdul Monem S. Rahma
(Member) (Supervisor)
Date: / / 2009 Data: / / 2009
Approved by, the Computer Science Department, University of Technology
Signature:
Name: Dr. Helal H. Saleh
Date: / / 2009
Head of Computer Science Department
Security is a request for a person … society… and world security is not a
responsibility or privilege accorded only to guards or security agents.
Information hiding research has become the focus of the information
security research because every Web sites and network communication
depend on the multimedia, such as audio, video, image and so on.
Information hiding technology can embed secret information into a
digital media source without impairing the perceptual quality of that
source; other people can’t feel this secret information.
In this thesis method is proposed for the art of data hiding by taking
advantage of the physical characteristics of computer system and how it
stores document file and treating it as a compound file. The unused Block
in this Microsoft Compound Document File Format (MCDFF) is used to
hide or conceal data. The possibilities provided by Microsoft Word
Processor program have also been utilized, such as Tools, to generate
cover for hiding.
The proposed system embeds Steganography Text in Structure (Binary
File Format) of digital and printed Text document file which is a file of
Microsoft Word Document file (Doc.) using two Processes of Hiding:
Cover Generation Process and Embedding Process.
Cover Generation Process: where the cover is a document of Microsoft
Word Document file format 2003 (doc.) and will appear to be the product
of a collaborative writing effort between Authors.
Embedding Process hiding Text string in Unused Block of Binary File
Format of that document cover.
I
This thesis introduce a system for hiding in Microsoft Word which is a
component of the Microsoft Office System and taking into account
Microsoft Office Applications it was found that Microsoft Word is less
vulnerability than other Microsoft Office Applications depending on the
last research published.
This system is implemented using Visual C sharp.NET 2003 language
with Windows XP service pack 2 as Operating System, on Laptop
computer type P4 with RAM 1GB and 2.00 GHz with Mobil Intel
processor to perform the proposed system.
II
LLiisstt ooff AAbbbbrreevviiaattiioonnss
Acronym Full Name
ASCII American Standard Code for Information Interchange
API Application Programming Interface
APIs Office Application Programming Interface
BAT Block Allocation Table
BPCS Bit Plane Complexity Segmentation
CBF Chunk Based Format
CFG Context Free Grammar
CLR Common Language Runtime
COM Component Object Model
DBF Directory Based Format
DCT Discrete Cosine Transformation
DirID Directory Identifier
DLL Dynamic-Link Library
FIB File Information Block
GIF Graphic Interchange Format
GUI Graphic User Interface
HAS Human Auditory System
HTML Hyper Text Markup Language
IEEE Institute of Electrical and Electronics Engineers
IH Information Hiding
JPEG Joint Photographic Expert Group
LSB Least Significant Bit
Mac Macintosh
MCDFF Microsoft Compound Document File Format
MSAT Master Sector Allocation Table
MSDN Microsoft Developer Network
MSDOS Microsoft Disk Operating System
OLE Object Linking and Embedding
PIA Primary Interop Assembly
PInvoke Platform Invoke
POIFS Poor Obfuscation Implementation File System
RMD Raw Memory Dumps
RTF Rich Text Format
SAT Sector Allocation Table
SBAT Small Block Allocation Table
SecID Sector Identifier
TCP\IP Transmission Control Protocol /Internet Protocol
UTF Unicode Transformation Format
VBA Visual Basic for Application
Win Windows
WYSIWYG What You See Is What You Get
XML Extensible Markup Language
LLiisstt ooff FFiigguurreess FFiigguurree NNoo.. DDeessccrriippttiioonn PPaaggee NNoo..
1.1 Information Hiding Hierarchy…………………………………. 4 1.2 Generic digital watermarking scheme………………………….. 5 1.3 Watermarking example…………………………………………. 6 1.4 A data hiding example………………………………………….. 9 2.1 Steganography basic model…………………………………….. 13 2.2 Steganography Types…………………………………………… 14 2.3 Text Hiding methods…………………………………………… 25 2.4 Color quantization……………………………………………… 30 2.5 Halftone quantization…………………………………………... 31 2.6 Huffman Tree for example…………………………………….. 35 2.7 Huffman tree for the 26-letter Alphabet……………………….. 36 3.1 Word Versions for Different Operating System……………….. 38 3.2 External Structure of a Word Document………………………. 41 3.3 Track Change Example………………………………………… 43 3.4 Comments Example…………………………………………… 43 3.5 File Structure Types……………………………………………. 45 3.6 logic view of file……………………………………………….. 47 3.7 Storage and Streams structure………………………………………… 48 3.8 Sample Word document storage format……………………….. 50 3.9 The structure of Hard Disk……………………………………. 54 3.10 MS Compound files structure………………………………… 64 3.11 Word Object Model…………………………………………. 66 3.12 Platform Invokes call to an unmanaged Dll Function…………. 67 4.1 Block Diagram for Proposed System ……………………………70 4.2 Screenshot of Microsoft Word in case of collaborative document authoring…………………………………………………………71 4.3 Author A sends a stegodocument S to a recipient B…………….72 4.4 Hiding Algorithm Flowchart…………………………………….76 4.5 Search Unused Block Algorithm Flowchart……………………. 80 4.6 Extracting Algorithm Flowchart…………………………………83
5.1 Word Reference…………………………………………………. 87 5.2 Block diagram for Unused Block path in Document file……….. 89 5.3 The main menu for the proposed system……………………….. .90 5.4 Cover Document before Track Change………………………… 90 5.5 Cover Document after Track change…………………………… 91 5.6 The Embedding Process Window………………………………. 94 5.7 Document after Hiding…………………………………………. 94 5.8 Extracting Process Window……………………………………. 95
LLiisstt ooff TTaabblleess
Table Name Description Page No.
2.1 Steganography Attacks.…………………………………….... 32 2.2 Probabilities of occurrence in English language.……………. 37 3.1 MCDFF Metadata...…………………………………………. 53 3.2 Compound document header structure……………………… 56 3.3 Header (block1)—512(0x200) bytes ……………………….. 57 3.4 Directory entry structure…………………………………….. 60 3.5 Property – 128(0x80) byte block……………………………. 61 3.6 Block Allocation Table.……………………………………... 63 3.7 Office 2003 applications and component type libraries…….. 65 5.1 Comparisons between the proposed system and other text hiding methods………………………………………………………….. 96
Glossary Terms Description
1 Byte order The order in which single bytes of a bigger data type are represented or stored.
2 Compound document
File format used to store several objects in a single file, objects can be organized hierarchically in storages and streams.
2 Compound document header
Structure in a compound document containing initial settings.
5 Control stream Stream in a compound document containing internal control data.
6 Directory List of directory entries for all storages and streams in a compound document
7 Directory entry Part of the directory containing relevant data for a storage or a stream.
8 Directory entry identifier (DirID)
Zero-based index of a directory entry.
9 Directory stream Sector chain containing the directory.
10 DirID Zero-based index of a directory entry
11 End Of Chain SecID
Special sector identifier used to indicate the end of a SecID chain.
12 File offset Physical position in a file.
13 Free SecID Special sector identifier for unused sectors
14 Header Short for “compound document header”.
15 Master sector allocation table (MSAT)
SecID chain containing sector identifiers of all sectors used by the sector allocation table.
16 MSAT SecID Special sector identifier used to indicate that a sector is part of the master sector allocation table.
17 Red-black tree Tree structure used to organise direct members of a storage.
18 Root storage Built-in storage that contains all other objects (storages and streams) in a compound document.
19 Root storage Directory entry representing the root storage.
22 SecID Zero-based index of a sector (short for “sector identifier”).
23 SecID chain An array of sector identifiers (SecIDs) specifying the sectors that are part of a sector chain and thus enumerates all sectors used by a stream.
24 Sector Part of a compound document with fixed size that contains any kind of stream (user stream or control stream) data.
No. Subject No. of page
1 Chapter One : General Introduction and Survey
1.1 Introduction 1 1.2 Information Hiding History 2 1.3 Information Hiding Hierarchy 4
1.4 The Difference between Cryptography, Steganography and Watermarking 6
1.5 Information Hiding Applications 7 1.6 Literature Survey 9 1.7 Aim of Thesis 11 1.8 Thesis Outlines 12
2 Chapter Two : Steganography 2.1 Introduction 13 2.2 Steganography Basic Model 13 2.3 Steganography Types 14 2.3.1 Pure Steganography 14 2.3.2 Secret Key Steganography 15 2.3.3 Public Key Steganography 16 2.4 Steganography Algorithms 16 2.4.1 Spatial Domain Based Steganography 16 2.4.2 Transform Domain Based Steganography 17 2.4.3 Document Based Steganography 18 2.4.4 File Structure Based Steganography 18 2.5 Steganography Under various Media 18 2.5.1 Hiding in Disk Space 18 2.5.2 Hiding in Network Packets 19 2.5.3 Hiding in Software and Circuity 20 2.5.4 Hiding in Video 20 2.5.5 Hiding in Audio 20 2.5.6 Hiding in Image 21 2.5.7 Hiding in Text 21 2.6 Classification of Text Hiding Techniques 21 2.7 Steganalysis 31 2.8 Attacks are available to the Steganalyst 32 2.9 Introduction to the code 33 2.10 Why Encode the Data 33 2.11 Huffman Coding 34
3 Chapter Three: Microsoft Word Document File
3.1 Introduction 38 3.2 History of Word 39 3.3 Microsoft Word Document and its Components 41 3.4 Annotation and collaboration Tools 42 3.4.1 Track Changes 42 3.4.2 Comments 43 3.5 File Format 44 3.6 Identify the Type of a File 44 3.6.1 Filename Extension 44 3.6.2 Magic Number 45 3.7 File Structure 45 3.7.1 Raw Memory Dumps/Unstructured Formats (RMD) 46 3.7.2 Chunk Based Formats (CBF) 46 3.7.3 Directory Based Formats (DBF) 46 3.8 Structure Storage 47 3.9 Microsoft Compound Document File Format(MCDFF) 49 3.10 Structure of a Word Documents files 50 2.11 Format of the Main Stream 52 3.12 MCDFF metadata 53 3.12.1 Compound Document Header 55 3.12.2 Byte Order 58 3.12.3 Sector File Offset 59 3.12.4 Property Table (Directory) 59 3.12.5 Block Allocation Table (BAT) 62 3.12.6 Sector Allocation Table (SAT) 64 3.13 Office Automation 64 3.14 PIA for Microsoft Office 2003 65 3.15 Word Object Model 65 3.16 Platform Invoke (PInvoke) 67 3.17 Application Programming Interface (API) 68 3.18 Office Application Programming Interface (APIs) 68
4 Chapter Four : Proposed Hiding System in Document File
4.1 Introduction 69 4.2 Cover Generation Process 71 4.3 Embedding Process 73
5 Chapter Five : Experimental Results and Discussion
5.1 Introduction 84 5.2 System Implementation 90
5.2.1 Document before Hiding 90 5.2.2 Embedding Process 91 5.2.3 Document after Hiding 94 5.2.4 Extracting Process 95
5.3 Comparisons between proposed system and the most popular hiding methods 96
6 Chapter Six : Conclusions and Suggestions for Future Work
6.1 Conclusions 97 6.2 Suggestions for Future Work 98 Glossary References I Appendix A II Appendix B III Appendix C
11 CChhaapptteerr OOnnee
ZZxxÇÇxxÜÜttÄÄ \\ÇÇààÜÜÉÉwwââvvàà||ÉÉÇÇ ttÇÇww
ffââÜÜääxxçç
Survey and Introduction General neO Chapter
1
CChhaapptteerr OOnnee
GGeenneerraall IInnttrroodduuccttiioonn aanndd SSuurrvveeyy 11..11 IInnttrroodduuccttiioonn [XIU06]
h
t
i
e development of the Internet, information processing
echnologies and the rapid development of communication, the
mages, audio, video and other multimedia information can be
rapidly transmitted in variety of communication networks, so it can provide
greater convenience to compression, storage, and reproduction processing
applications. At the same time, it is convenient to share information
resources, and the network has become the main means of communication.
Now, all confidential information, including national security information,
military information, and personal information (such as credit card
numbers), are required for transmission through the network, but the
Internet is an open environment, so information security has become
increasingly important today.
TT
Information security technology has two main branches: cryptography
and information hiding. Cryptography was widely used in various
industries. There have been many years of research in encryption
technology and there are many encryption algorithms. But the encryption
technology can clearly inform users that the documents or other media have
been encrypted, the attacker can use a variety of tools to attack the secret
information. Although the technique of encryption developed rapidly, but
the attacker’s tool is also strengthened. It is the so-called “instructors
always keep one step ahead”. Because of the rapid development of
Survey and Introduction General neO Chapter
2
computer capabilities, some limitations already appear in the application of
encryption technology. This makes people pay more attention to the other
main branch of information.
The purpose of the traditional encryption technology is to conceal the
content, so the encrypted documents are difficult to read.
11..22 IInnffoorrmmaattiioonn HHiiddiinngg HHiissttoorryy
Hiding messages is nothing new over the past years; multitudes of
methods have been used to hide information. One of the first documents
describing steganography is from the histories of Herodotus. In ancient
Greece, the text was written on wax covered tablets. To avoid capture, he
scraped the wax off the tablets and wrote a message on the underlying
wood. He then covered the tables with wax again. The tables appeared to
be blank and unused so they passed inspection by sentries without question
[JOH99].
Historically various steganographic techniques have been used
including:
I. Tattoo. A Roman general that shaved the head of a slave
tattooing a message on his scalp. When the slave's hair grew
back, the general dispatched the slave to deliver the hidden
message to its intended recipient [DIC07].
II. Character marking. Select letters of printed or typewritten text
are over written in pencil. The marks are ordinarily not visible
unless the paper is held at an angle to bright light [DOB97].
III. Invisible ink. From the 1st century through World War II
invisible inks were often used to conceal hidden messages. A
number of substances (milk, vinegar, fruit juices and urine) can
Survey and Introduction General neO Chapter
3
be used for writing. They leave no visible trace until heat or some
chemical is applied to the paper.
IV. Pin punctures. Small pin punctures on selected letters are
ordinarily not visible unless the paper is held up in front of a
light [DOB97].
V. Microfilm. While Paris was under siege in 1870, messages were
sent by carrier pigeon. A Parisian photographer used a microfilm
technique to enable each pigeon to carry a higher volume of data
[DIC07].
VI. Null ciphers (unencrypted message) were also used. In this
method the first letter of each word spells out a message. But
messages are very hard to construct [KAH96].
The following message was actually sent by a German spy during
Second World War [RIM97].
"Apparently neutral's is thoroughly discounted and ignored.
Isman hard hit. Blockade issue affects pretext for embargo
on by- products, ejecting suets and vegetable oils".
Decoding this message by taking the second letter in each word reveals
the following secret message:
"Perishing sails from NY June 1".
Survey and Introduction General neO Chapter
4
11..33 IInnffoorrmmaattiioonn HHiiddiinngg HHiieerraarrcchhyy Information Hiding (IH) is a kind of technique in the area of
information security. It is a technique to secretly embed information into
digital contents such as images, audios, movies, document, so that it cannot
be visually or audibly perceived, a data hiding example can be shown in
figure (1.4) [YOS06].
The Terminology which was agreed at first international workshop on
this subject in Figure (1.1) [CAC98]:
Covert channels in the context of multilevel secure systems (e.g.
military computer systems),as communication paths were neither
designed nor intended to transfer information at all these channels
typically used by untrustworthy programs to leak information to their
owner while performing a service for another program [KAT00].
Anonymity is finding ways to hide the Metacontent of messages,
that is, the sender and the recipients of a message [KAT00].
IH
Copyright marking
Steganography Anonymity Covert channels
Robust Fragile Copyright Watermarking
fingerPrinting
Watermarking
Figure (1.1) Information hiding hierarchy
Survey and Introduction General neO Chapter
5
Steganography an important sub discipline of information hiding is
art and science of communicating in a way which hides the existence
of the communication [KAT00].
Fingerprinting is a term that denotes special applications of
watermarking. It relates to watermarking application which
information such as the creator or recipient of digital data is
embedded as watermarks [KAT00].
In contrasting to Steganography, Copyright marking guarantees
that embedded data can be reliably detected after the image has been
modified (but not destroyed beyond recognition) [CAC98].
Watermarking is the process of embedding information into digital
multimedia content such that the information (which we call the
watermark) can later be extracted or detected for a variety of
purposes including copy prevention and control, an example of
watermarking can be shown in figure(1.3) [BAK05].
Watermark host Data Watermark Data secter/public key (K)
Marking Algorithm
Figure (1.2) Generic digital Watermarking scheme [KAT00]
There are several approaches to classify watermarking systems. One could
categorize them according to the watermarking powerful against types of
attack.
Survey and Introduction General neO Chapter
6
Fragile Watermarks are watermarks that have only very limited
robustness. The embedded watermarks will change, or disappear, if a
watermarked object is altered. This type of watermark can be used
for authentication purpose to verify the originality of watermarked
object [BAK05].
Robust watermarking is designed to survive "moderate to severe
signal processing attacks". In such a way that any signal transform of
reasonable strength cannot remove the watermark. Robust
watermarks are public able in image copyright protection and
fingerprinting [BAK05].
Figure (1.3) watermarking example [ROC08]
11..44 TThhee DDiiffffeerreenncceess bbeettwweeeenn CCrryyppttooggrraapphhyy,, SStteeggaannooggrraapphhyy aanndd WWaatteerrmmaarrkk.. The cryptographer's interest is primarily with obscuring the content of
a message, but not the communication of the message. The steganographer,
on the other hand is concerned with hiding the very communication of the
message, while the digital watermarked attempts to add sufficient metadata
to a message to establish ownership, provenance, source, etc. Cryptography
and steganography share the feature that the object of interest is embedded,
Survey and Introduction General neO Chapter
7
hidden or obscured, whereas the object of interest in watermarking is the
host or carrier which is being protected by the object that is embedded,
hidden or obscured. Further, watermarking and steganography may be used
with or without cryptography; and imperceptible watermarking shares
functionality with steganography, whereas perceptible watermarking does
not [BER06].
11..55 IInnffoorrmmaattiioonn HHiiddiinngg AApppplliiccaattiioonnss [XIU06] The advantages of information hiding technology have been applied in
many prospects, including e-commerce, electronic transaction protection,
confidential communications, copyright protection, copy control, operation
tracking, authentication, and signature fields.
A recent research shows that the following applications of information
hiding stimulated people’s research interest:
I. Military organization and other intelligence agencies need secret
communication. In the modern battlefield when the sensitive
signal detection may lead to the rapid release of the attacks, the
military often used communications preparation or distribution of
atmospheric scattering of spectral transmission technology to
ensure accurate signal transmission.
II. Terrorists are also studying the use of information hiding
technology. Through research, the US anti-terrorist organizations
analysis that in the September 11 incident, the terrorists used
steganograhpy technique, which embed the instructions into
multimedia (such as images) and transmitted in Internet If there
were no hidden writing specialized analysis tools, it is difficult to
detect concealing write processed pictures.
Survey and Introduction General neO Chapter
8
III. As the electronic-commerce is springing up, information
security becomes more important. In addition to encryption
technique, people are more concerned about the hidden message
authentication techniques.
The extensive application of information hiding technology can be
roughly categorized as follows:
Secret communications: it hided the communications process
and the communicators.
Copyright protection: authorized Watermark perceived to be
embedded in the way of multimedia.
Testing and certification: digital works could be carried out
certification, and to tamper with a test.
Piracy tracking: used to track the author or some backup
buyers.
Information identified: some of the information is hidden in
the carrier medium, in order to interpret some elements about
the medium.
Reproduction control and access control: with embedded
digital watermarks to express some of the access control
system.
Information control: using information technology to control
certain information.
Bills security: Bills security is to make sure that the hidden
watermarks on the bills could still exist after printed. It can
guarantee the authenticity of the bills.
Survey and Introduction General neO Chapter
9
Message to be hidden The cover image The prodece stego image
Figure (1.4) a data hiding example [ROC08] 11..66 LLiitteerraattuurree SSuurrvveeyy The following is a review of different works used in environment:
I. Abdul Wahab, H., B., 2001, [ABD01] "Information Hiding in
written Text Using Context Free Grammar (CFG) ", this work embeded
text (English text) after being constructed according to CFG in another text
(English Text). The proposed system gives good results and can be applied
in several cases in life when sending encrypted message that draws
suspicions.
II. Al-Shamkhy, R., A., 2001, [ALS01] "Hiding Text in Text Using
Dictionary Method", This Thesis proposed a system that uses the text
media to embed its secret file text depending on a dictionary. This
dictionary contains English words sorted in an alphabetical order to be
Survey and Introduction General neO Chapter
10
selected by user in order to build the cover message. The receiver does not
need this dictionary, this will decrease the amount of information which is
needed on the receiver side and this will increase the security of the
proposed system.
III. Al-Saady, B., Y., 2005,"Document Protection Using Digital
Watermarking ", in this thesis, four methods are suggested to embed a
watermark in a document created by Microsoft word program. The two
types of watermarking suggested are visible as a background, and invisible
watermark that depends on the macro technique. The ability of macro
program to run with document helps us to use the macro program to control
the watermarking operation. There are three suggested methods to use the
macro program as a tool to protect both watermark and document from the
unauthorized modification. These methods are powerful methods to protect
both watermark and document when applied to Microsoft word document.
IV. Al-Abaichi, A., M., 2005,"Analyzing and Detecting Information
Hiding in Computer Printed Text", the proposed system is used to
analyze and detect hidden information in the printed text after converting it
to a gray scale image consisting of two phases, analysis and detection. In
the first phase, the boundary of the text image, the baseline from two sides,
beginning and ending with each (line, word, and character) are fixed, the
gaps between words and at the ending of lines are determined and No. of
line, No. of words No. of characters and No. of gaps between words are
calculated. Each detection phase deals with mainly four methods used for
hiding the secret message in a format text such as line-shift(up, down),
open space method (inter-word-space, and of line space and inter-sentence-
space),word-shift (horizontal) and feature code (shorten or lengthen the
upward, shorten or lengthen the downward) of the character.
Survey and Introduction General neO Chapter
11
V. Eckstein, K. and Jahnke, M. 2005, "Data hiding in Journaling
File Systems", this article structures and compares existing data hiding
methods for UNIX file systems in terms of usability and countermeasures.
It discusses variant techniques related to advanced file system and proposes
a new technique that stores substantial amounts of data inside journaling
file systems in a robust fashion with low delectability, which is
demonstrated by means of a proof-of-concept implementation for the exit
journaling file system.
VI. Lie, T., Y., and Tsai W., H., 2007, [LIU07] "A New Steganography
Method for Data Hiding in Microsoft Word Documents by a Change
Tracking Technique", this research proposed method for hiding by taking
text segments in the document and degenerated, mimicking to be the work
of an author with inferior writing skills, with the secret message embedded
in the choices of degenerations. The degenerations are then revised with the
changes being tracked, making it appear as if a cautious author is correcting
the mistakes.
11..77 AAiimm ooff TThheessiiss The aim of this thesis is to use Information Hiding Technology to
embed Text in structure (Binary File Format) of digital and printed Text
document which is Microsoft Word Document file 2003 using
Steganography method.
This can be achieved by the following:
The cover document which is a Document of Microsoft Word
Document 2003 is made to be the product of a collaborative
writing effort between many authors to avoid drawing
suspensions that there is hidden data in document.
Survey and Introduction General neO Chapter
12
11..88 TThheessiiss OOuuttlliinneess This thesis begins with an introduction to information hiding technique
and its hierarchy.
CChhaapptteerr TTwwoo: "Steganography ", presents a general description of
Steganography, Text hiding methods and Huffman Encoding.
CChhaapptteerr TThhrreeee: "Microsoft Word Document File Format" introduces a
complete description about Microsoft Word Document the software and its
file format and structure.
CChhaapptteerr FFoouurr: "Proposed Hiding System in Microsoft Compound
Document file Format ", presents a Cover generation process, MCDFF
metadata and Hiding processes.
CChhaapptteerr FFiivvee:: "Experiment Results and Discussion" introduces a
complete description about the proposed method implementation and
results.
CChhaapptteerr SSiixx:: "Conclusions and Suggestions for Future work ", presents
the derived conclusions and some suggested ideas for future work.
22
CChhaapptteerr TTwwoo
ffààxxzzttÇÇÉÉzzÜÜttÑÑ{{çç
Chapter Two Steganography
CChhaapptteerr TTwwoo
SStteeggaannooggrraapphhyy
22..11 IInnttrroodduuccttiioonn he word Steganography comes from two roots in the Greek
language, "Stegos" meaning hidden/covered or roof, and
"Graphia" simply means writing [KRE04].
The Goal of Steganography is to hide message inside other harmless
message in a way that does not allow any enemy to even detect that there is
a second secret message present (to avoid drawing suspensions) [KAT00].
T TSteganography uses the illusion of normality to mask the existence of
covert activity. The illusion is manifested through the use of a myriad of
forms including written documents, photographs, paintings, music, sounds,
physical items, and even the human body. Two parts of the system are
required to accomplish the objective, successful masking of the message
and keeping the key to its location and/or deciphering a secret [DIC07].
22..22 SStteeggaannooggrraapphhyy BBaassiicc MMooddeell
Stego KeyStego Key
13
Figure (2.1) steganography basic model
EmbeddingProcess
Cover
ExtractingProcess
Message to hide
Stego Cover
Cover
Hidden Message
Chapter Two Steganography
14
The message inside a Cover ((or
is used to extract secret message from a carrier.
.
2.3 Steganography Types There
Figure (2.2) Steganograhy Types
2.3.1 Pure Steganography [KAT00]
A steganography system which does not require the prior
embedding and extracting algorithm.
Each data hiding Method consists of:
I. Embedding Process.
II. Extracting Process.
Embedding Process is used to hide secret
carrier).The Cover carrier and the embedded message create a stego-
carrier.
The Extracting Process
Hiding information may require a stegokey or password that is additional
secret information so that only those who possess the secret keyword can
access the hidden message.
Cover medium + Embedded massage+ Stegokey = Stego- medium
2.3 Steganography Types are basically three types of steganographic protocols
described in the following figure:
Secret key hSteganograp
Steganography
Pure Ste aphganogr
Public Key hSteganograp
exchange of some secret information (like stego-key) is called a pure
Steganography. Both sender and receiver must have access to the
Chapter Two Steganography
Definition: (Pure steganography)
The quadruple б = < C, M, D, E >, where C is the set of possible covers,
M the set of secret messages with |C| ≥ | M |,
E: C × M → C the embedding function, and
D: C→ M, the extracting function,
With the property that D (E(c, m)) = m for all m ∈ M and c C is called a
Secret key steganography is defined as a steganographic system
that requires the exchange of a secret key (stego-key) prior to
communication. Secret key steganography takes a cover message and
The quintuple б = < C, M, K, D, E >, where C is the set of possible covers,
|, K the set of secret keys,
E k: C ×M ×K → C and
∈
pure steganography system.
2.3.2 Secret Key Steganography
embeds the secret message inside it by using a secret key (stego-key). Only
the parties who know the secret key can reverse the process and read the
secret message. Unlike pure steganography where a perceived invisible
communication channel is present, secret key steganography exchanges a
stego-key, which makes it more susceptible to interception. The benefit of
secret key steganography is even if it is intercepted; only parties who know
the secret key can extract the secret message [DUN02].
Definition: (Secret Key Steganography)
M the set of secret messages with |C| ≥ | M
15
Chapter Two Steganography
Dk: C × K→ M
With the property that Dk (Ek(c, m, k), k) = m
For all m M, c C and k∈ ∈ ∈ K, is called a secret key steganographic
n
nge of secret key. Public key steganography system
requires the use of two keys, one private and one public key; the public key
eas the public key is used in the
ain based steganography;
cludes LSB (Least Significant Bit)
CS (Bit Plane Complexity Segmentation) algorithm.
he spatial methods are most frequently employed by steganography tools
bec hidden information and
system [KAT00].
2.3.3 Public Key Steganography
As i public key cryptography, public key steganography does not
rely on the excha
is stored in a public database, wher
embedding process, the secret key is used to reconstruct the secret
message[KAT00].
22..44 SStteeggaannooggrraapphhyy AAllggoorriitthhmmss Stegaongraphy Algorithms are classified according to five categories:
(1). Spatial domain based steganography;
(2). Transform dom
(3). Document based steganography;
(4). File structure based steganography;
(5). Other categories.
2.4.1 Spatial Domain Based Steganography
Spatial steganography mainly in
steganography and BP
T
ause of fine concealment, great capability of
easy realization [MIN06].
16
Chapter Two Steganography
LSB Replacement & Matching
Least Significant Bit (LSB) which replaces the least significant bit
some bytes of the cover file to hide a sequence of bytes which contains
e hidden data, LSB steganography includes two schemes:
Seque bedding. Taking images as
exa
to control the
to the same size pixel-blocks. The
BPCS’s capacity can reach 50% of the cover image data. However, the
information, but they are highly vulnerable to even small cover
modification. An attacker can simply apply signal processing techniques in
s
t to various kinds of signal
in
th
ntial embedding and scattered em
mple, sequential embedding replaces the pixels’ LSBs with the message
one by one sequentially. Scattered embedding makes message randomly
scatter over the whole image by a random sequence
embedding places.
BPCS Steganography
As the approach of bit-replacing in LSB steganography, BPCS
steganography hides secret data by the way of block-replacing, each bit
plane of the image is segmented in
large capacity embedding will bring more influence to the image [MIN06]. 2.4.2 Transform Domain Based Steganography [KAT00]
The LSB modification techniques are easy ways to embed
order to destroy the secret information entirely.
Transform domain methods hide messages in significant area of the
cover image which makes them more robust to attacks, such as
compression, cropping, and some image processing, than the LSB
approach. However, while they are more robus
processing, they remain imperceptible to human sensory system.
Many transform domain variations exist. One method is to use the
discrete cosine transformation (DCT).
17
Chapter Two Steganography
2.4.3 Document based Steganography
This kind of tools embeds data in document files by adding tabs or
spaces to .txt or .doc files. One of the provided steganographic tool is
Software called Snow Snow embeds data in .txt files by adding tabs and
and the spaces are segmented with a tab. So the number of secret bits
should be a multiple of 3, otherwise they would be filled up with 0 bits.
the
isual/aural Attack and the statistical detection [MIN06].
The onset of computer technology and the internet has given new life to
steganography and the creative methods with which it is employed.
carriers [JOH01].
aking advantage of unused or
reserved space to hold covert information provides a means of hiding
spaces at the end of text line. Every 3 bits are encoded with 0 to 7 spaces
2.4.4 File structure based Steganography
Structural embedding inserts secret data in the redundant bits of
cover file, such as the reserved bits in the file header or the marker
segments in the file format. This makes hidden data immune to
v
22..55 SStteeggaannooggrraapphhyy uunnddeerr VVaarriioouuss MMeeddiiaa
Computer-based steganographic techniques introduce changes to digital
carriers to embed information foreign to the native
Carriers of such message may resemble innocent sounding text, disks
and storage devices, network traffic and protocols the way software or
circuits are arranged, audio, images, video, or any other digitally
represented code or transmission [JOH01].
2.5.1 Hiding in Disk space [MIK07]
Another way to hide information relies on finding unused space that
is not readily apparent to an observer. T
18
Chapter Two Steganography
information without perceptually degrading the carrier. The way operation
systems store files typically results in unused space that appears to be
allocated to files. Another method of hiding information in file system is to
create a hidden partition. These partitions are not seen if the system is
tarted normally. However, in many cases, running a disk configuration
e in rnet. Any
of these packets can provide a covert communication channel. The packet
hat can be manipulated to hide
s
utility exposes the hidden partition. These concepts have been expanded in
a novel proposal of a steganographic file system. If the user knows the file
name and password, then access is granted to the file; otherwise, no
evidence of the file exists in the system of the hidden files.
2.5.2 Hiding in Network packets [JOH01]
Various network protocols have characters that can be used to hide
information. TCP/IP packets are used to transport information; an
uncountable number of packets are transmitted daily over th te
headers have unused space or other values t
information. However, filters can be set to detect information in the
"unused" or reversed spaces. One way to circumvent this detection is to
take advantage of information in the headers that typically go unchecked
by most systems. Such information includes the values for sequence and
identification numbers.
19
Chapter Two Steganography
2.5.3 Hiding in software and circuitry
Data can also be hidden based on the physical arrangement of a
carrier. The arrangement itself may be an embedded signature that is
nique to the creator. An example of this is in the layout of code distributed
circuits on a board, this type of
"marking" can be used to uniquely identify the design origin and cannot be
mov
hide data. Due to the size of video files, the scope
for adding lots of data is much greater and therefore the chances of hidden
e and a range of frequencies greater than one
thousand to one making it extremely hard to add or remove data from the
u
in a program or the layout of electronic
re ed without significant change to the network [JOH01].
2.5.4 Hiding in video
For video, a combination of sound and image techniques can be
used. This is due to the fact that video generally has separate inner files for
the video (consisting of many images) and the sound. So techniques can be
applied in both areas to
data being detected is quite low [CUM04].
2.5.5 Hiding in Audio
Data hiding in audio signals is especially challenging, because the
Human Auditory System (HAS) operates over a wide dynamic range. To
put this in perspective, the (HAS) perceives over a range of power greater
than one million to on
original data structure. The only weakness in the (HAS) comes at trying to
differentiate sounds (loud sounds drown out quiet sounds) and this is what
must be exploited to encode secret messages in audio without being
detected [DUN02].
20
Chapter Two Steganography
2.5.6 Hiding in Image
Given the proliferation of digital images, especially on the Internet,
nd given the large amount of redundant bits present in the digital
age, images are the most popular cover objects for
steganography [MOR00].
s as hosts for steganographic messages takes
advantage of the limited capabilities of the human visual system. Encoding
Important point must be said that the embedding task in text requires
user; it therefore cannot be automated, while image
and audio can embed the data directly and automatically according to its
ways have been proposed to hide information directly in text
Syntactic method: where the structure of sentences is transformed
a
representation of an im
Using image file
extra data in an image file changes pixels in the image, but these changes
would remain imperceptible to the human eye [BER05].
2.5.7 Hiding in Text
Written Text can be used as a method to transmit secret messages.
Only small amounts of data can be hidden when hiding data in text. Thus,
this method is known to have a common low data rate.
the interaction of the
algorithm.
22..66 CCllaassssiiffiiccaattiioonn ooff TTeexxtt HHiiddiinngg TTeecchhnniiqquueess::-- Steganograhy methods can try to encode the information directly in the
text or in the text format as shown in figure (2.3).
I. Encoding Information Directly in the Text
Many
like Syntactic, Semantics, P.Waynar, Chapman, Translation and HTML.
without significantly altering their meaning. This method utilizes
punctuation, diction [VIL06].
21
Chapter Two Steganography
ample of using punctuation: Ex
Th
consid
appears before the "and" this represents as a "1" and the second phrase
represents as a "0"[ALS01].
ructure of the text:
e sentence this will encode as a "1",when an
is will be encoded as a
ilobytes of text,
ader and changing the
be considered primary and the word "large" is
ver, syntactic and semantic methods are not suitable for all types
ents, literary texts) and need,
e phrase "bread, butter, and milk" and "bread, butter and milk" are both
ered correct usage of commas as a list, such that when the comma
Example of using Diction and st
The sentence "Before the night is over, I will finish" and
The sentence "I will finish before the night is over"
This method is more transparent than the punctuation method .When a verb
comes at the beginning of th
adverbial comes at the beginning of the sentence th
"0"[ALS01].The expected data rate only several bits per k
use of punctuation is noticeable to even casual re
punctuation will impact the clarity and even the meaning of the text so this
can be considered as a Disadvantage of using Punctuation.
Semantics Method
Where words are replaced by their synonyms and/or sentences are
transformed via suppression or inclusion of noun phrase coreferences
[VIL06].
Example of using Semantic Method
The word "big" could
considered secondary. Decoding primary words will be read as ones,
secondary words as zero [ALS01].
Howe
of documents (e.g. contracts, identity docum
in general, human supervision [VIL06].
22
Chapter Two Steganography
P.Wayner Method
Peter Wayner proposed a Mimic Function which exploits the
tatistical profile of a message, since the stego-objects are created only
ccording to statistical profile, the semantic component are entirely
nored.
Wayner described one of the most promising techniques, he uses
(CFG) to create cover-text and chooses the productions according to the
chniques [KAT00].
complished by the use of a parse tree for the
T and SCRAMBLE. Given a large dictionary of
ords out of the
s
a
ig
secret message to be transmitted, the secret information is not embedded in
the cover, and the cover itself is the secret message. If the grammar is
unambiguous the receiver can extract the information by applying standard
parsing te
Wayner proposed an extension to the technique of mimic function,
given a set of production, assigning a probability to each possible
production. The sender then constructs a Huffman compression function
and converts the secret message to a binary bit. The receiver then parses the
cover in order to reconstruct the productions which have been used in the
embedding step; this can be ac
given CFG [ALS01].
But the vulnerable aspect of this technique is difficult to select
meaningful type categories without considering the eventual grammatical
requirements of a natural-language style-source [ALD05].
Chapman and Davida Method
Chapman and Davida proposed a system which consists of two
functions, NICETEX
words of different types, and a style source, which describes how words of
different types can be used to form a meaningful sentence, NICETEXT
transforms secret message bits into sentence by selecting w
23
Chapter Two Steganography
dictionary which conform to a sentence structure given in style source
[ALS01].
SCRAMBLE reconstructs the secret if the dictionary which has been used
is known. Style resources can either be created from natural-language
entence or be generated using CFG [ALS01].
he most obvious problem with the manual method is that it takes too long
s with thousands of words [ALD05].
tion process, especially in
resulting from translation-
ased steganography are inconspicuous. The translation-based approach,
how s [LIU07].
ed until the source
f the page is revealed [KAT00].
s
T
to enter large lists. Nicetext focuses on creating large, sophisticated
dictionarie
Translation- based steganography
Use the expected errors in the transla
machine translation, to solve the issue of producing implausible text;
information is hidden in the noise that occurs in language translation. In
cases where sending imperfect translations to a
b
ever, may be vulnerable to active attack
HTML
Information is hidden in HTML files by adding useless spaces and
line breaks or by changing the case of letters in the tags [JOH98].
Html files are good candidates for including extra spaces but Web
browses ignore these "extra" spaces and they go unnotic
o
24
Chapter Two Steganography
Figure (2.3) Text hiding method
Text Hiding
Techniques
Encoding Information Directly inThe Text
Encoding Information
In The Tex Format
Semantic
method
Syntax
method
P.Wayner
method
ChapmanDaivdeamethod
Feature
encoding
Line-shift
encoding
Word-shift
encoding
Open-space
encoding
Binary code
Binary
code
Binary code
Binary code
Binary code
Binary code
Binary code
Binary code
Translation based
Steganography
HTML
Color quantizati
on
Halftonequantizat
ion
Binary code
Binary code
Binary code Binary
code
25
Chapter Two Steganography
43
Chapter Two Steganography
II. Encoding Information in the Text Format [ALS01].
Information can be embedded in the format rather than in the
message itself. secret information can be stored in the size of inter-line
or inter-word spaces. If the spaces between two lines are smaller than
some threshold, a "0" is encoded; otherwise a "1" is encoded. Infrequent
additional white space characters are introduced to form the secret
message.
Open Space method
Encode through manipulation of white space (unused space) on the
printed page. There are three methods for using white space to encode
data.
Inter-Sentence Spacing [ALS01].
This method deals with encoding a binary message into a text by
placing one or two spaces after the sentence, such that one space
represents "0" and two spaces represent "1".
The disadvantage of this method is that it is insufficient, requiring a
great deal of text to encode a very few bits(one bit per sentence).This
equates to a data rate of approximately one bit per 160 bytes assuming
sentences are on average two 80 character lines of text. Its ability to
encode depends on the structure of the text and many word processors
automatically set the number of spaces after periods to one or two
characters.
A. End-of-line spaces [ALS01].
This method deals with inserting spaces at the end of lines. The data
are encoded allowing for a predetermined number of spaces at the end
of each line. This method has a number of advantages in that it goes
unnoticed by readers and the amount of hidden information is maximum
26
Chapter Two Steganography
than inter-sentence method and the disadvantage like some programs
like "sendmails" may in advertently remove the extra space characters.
B. Inter-Word-Spaces [ALS01].
Using the white space to encode data involves right justification of
text. One space between words is interpreted as a "0".Two spaces are
between words are interpreted as a "1". This method has a number of
advantages like changing the number of trailing space, there is little
chance of changing the meaning of a phrase or sentence and the casual
reader is unlikely to take notice of slight modifications in white space.
The disadvantage is that if the reader does not notice its manipulation,
then the word processor may inadvertently change the number of
spaces, destroying the hidden data.
Line-Shift Coding
In this method, text lines are vertically shifted (moved up or down)
according to the secret message bits, whereas other lines are kept
stationary for the purpose of synchronization. If a line is moved up, a
"1" is encoded; otherwise a "0" is encoded [DUC01].
The disadvantage of this method is that it represents the most visible
text coding technique to the reader; large documents encode a few bits
(one bit per line) and the need for the original message may decrease the
security of the system [ALS01].
Word-shift Coding [ALS01]
In this method, codewords are coded into a document by shifting the
horizontal or vertical locations of words within text lines, while
maintaining a natural spacing appearance.
This method is only applicable to documents with variable spacing
between adjacent words.
27
Chapter Two Steganography
as a result of this variable spacing, it is necessary to have the original
image, or to at least know the spacing between words in the un encoded
document.
A. Encode Codeword (Horizontal Shift- Word)
For each text line, the largest and the smallest spaces between words
are found. It is possible to alter every space between two words
[ALS01].
For example take the Sentence1:
We explore new steganographic and cryptographic algorithms and
techniques throughout the world to produce wide variety and security
in the electronic web called the Internet
Applying some horizontal shifting word algorithm to obtain the
following sentence
Sentence 2:
We explore new steganographic and cryptographic algorithms and
techniques throughout the world to produce wide variety and security
in the electronic web called the Internet.
By overlapping the two sentences, obtain the following:
We explore new steganographic and cryptographic algorithms and
techniques throughout the world to produce wide variety and
security in the electronic web called the Internet.
This is achieved by expanding the space before wide, web by one point
and condensing the space after explore, the world by one point in
sentence1,the sentences containing the shifted words appear harmless,
but combining this with the original sentence produces a different
message: explore the world wide web.
In the same method, can encode binary message instead of encoded
word. For example, if expand the space before explore, the world,
28
Chapter Two Steganography
wide, web by one point, this will be encoded as "1", and if condense
the space after explore, the world, wide, web by one point, this will be
encoded as "0".
By applying random horizontal shifts to all words in the document, an
attacker could eliminate the encoding.
B. Encode Codeword (Vertical Shift- Word)
Shifting the vertical locations of words can be used to help identify
an original document. A similar method can be applied to display an
entirely different message [ALS01].
For example take the following sentence:
We explore new steganographic and cryptographic algorithms and
techniques throughout the world to produce wide variety and security
in the electronic web called the Internet.
Applying some vertical shifting word algorithm to obtain the following
sentence:
We explore new steganographic and cryptographic algorithms and
techniques throughout the world to produce wide variety and security
in the electronic web called the Internet.
In the same method, can encode binary message instead of encoded
word. For example, if shift up the words explore, the world, this will
be encoded as "1", and if we shift down the words wide, web this will
be encoded as "0".
Feature Encoding
Where feature such as Shape, Size, or Position are manipulated .In
this method certain text features are altered, or not altered depending on
the codeword. For example, one could encode bits into text by
extending or shortening the upward, vertical end lines of letters such as
29
Chapter Two Steganography
b, d, h, etc. generally before encoding, feature randomization takes
place. Character end line lengths would be randomly lengthened or
shortened, then altered again to encode the specific data. This removes
the possibility of visual decoding, as the original end line lengths would
not be known to code, one requires the original image.
Examples of using feature coding
Long d can be decoded of as "1" short d can be decoded as "0".
Long h can be decoded of as "1" short d can be decoded as "0".
Long b can be decoded of as "1" short d can be decoded as "0".
This method has a number of advantages like high amount of data
encoding, largely indiscernible to the reader; the disadvantage is that the
feature coding can be defeated by adjusting each endline length to fixed
value [ALS01].
Color quantization [VIL06]
The main idea of this method is to quantize the color or luminance
intensity of each character in such a manner that the human visual
system is not able to distinguish between the original and quantized
characters, but it can be easily performed by a specialized reader
machine. An example illustrating this method is shown in Figure (2.4).
Therein, dark characters encode a 0, whereas light ones encode a 1. A
binary sequence can be sequentially embedded into the cover text.
Notice that the embedding rate is comparatively higher than the rate of
inter-line or inter-word space modulation methods.
VAMOS A TRABAJAR
(a)
VAMOS A TRABAJAR
0 1 0 1 1 0 0 1 0 0 0 1 0 1
(b)
Figure (2.4) .Color quantization: (a) original text; (b) marked text (exaggerated)
30
Chapter Two Steganography
Halftone Quantization [VIL06]
This method relies on half toning, a widely used printing technology
that enables continuous tone images to be printed with one color ink
(grayscale) or a few color inks (color). Here, the discussion is restricted to
black & white printers.
In order to simulate a given gray shade a halftone printer uses a
halftone screen. This method exploits the fact that there exist several
possible choices for the halftone screen leading to the same gray shade.
Therefore, one can use this property in order to hide data on each text
character by using a different halftone screen according to the message m
that wishes to embed. The major strength of this method is that all
characters in the stego text will have the same grade shade. This method is
intended mainly for printed documents.
(a) (b) (c)
Figure (2.5) Halftone quantization: (a) Original character; (b) marked character for m = 0;
(c) Marked character for m = 1.
22..77 SStteeggaannaallyyssiiss A goal of steganography is to avoid drawing suspicion to the
transmission of hidden message. If suspicion is raised, this goal is defeated.
Steganalysis is the art of discovering and rendering useless such covert
message [JOH01].
In other words steganalysis attempts to detect the existence of hidden
information [ALS01].
31
Chapter Two Steganography
the steganlyst is one who applies a stganalysis in an attempt to detect the
existence of hidden information and /or render it useless. Two aspects of
steganalysis involve the detection and distortion of embedded messages
Detection requires that the analyst observes various relationships between
combinations of cover, message, stego-media, and steganograghy tool.
Distortion attacks require that the analyst manipulates the stego-media to
render the embedded information useless or remove it altogether [ETT98].
22..88 AAttttaacckkss aarree aavvaaiillaabbllee ttoo tthhee SStteeggaannaallyysstt There are many possible situations which confront the Steganalyst,
depending on what information is available. The different cases are shown
in table (2.1) [JAJ98]: Table (2.1) Steganography Attack
1-Stego-only attacks: only the stego-object is available for analysis.
2-Known cover attack: the "original" cover-object and stego-object are
both available.
3-Known message attack: At some point, the attacker may know the
hidden message. Analyzing the stego-object for patterns that correspond
to the hidden message may be beneficial for future attacks against that
system. Even with the message, this May be very difficult and may even
be considered equivalent to The Stego-only attack.
4-Chosen stego attack: The steganograghy tool (algorithm) and Steg-
object is known.
5-Chosen message attack: the steganalyst generates stego-object from
some steganography tool or algorithm from a chosen message. The goal
in this attack is to determine corresponding patterns in the stego-object
that may point to the use of specific steganography tools or algorithms.
32
Chapter Two Steganography
6-known stego attack: The steganography algorithm (tool) is known and
both the original and stego-objects are available.
22..99 IInnttrroodduuccttiioonn ttoo tthhee CCooddee [ABD01] A code is nothing more than a set of strings over a certain alphabet. For
example, the set C= {0, 10, 110, 1110} is a code over the alphabet {0, 1}.
Of course, codes are generally used to encode message. For instance, it
may use the set C to encode the first four letters of the alphabet, as follows:
a 0
b 10
c 110
d 1110
Then can encode words (or messages) built up from these letters. The word
"cab", for instance, is encoded as
cab 110010
22..1100 WWhhyy EEnnccooddee tthhee DDaattaa [KUO70] There are three reasons to encode data that is about to be transmitted
(through space, for instance) or stored (on computer disk, for instance).
The first reason is for efficiency. It clearly makes sense to compress data
as much as possible in order to save transmission time or storage space. In
fact, data compression is very big business in the computer world. The
second reason to encode data is for error detection and /or correction.
The third reason is for secrecy, so that unauthorized persons cannot read
the data.
In other words, the goals of encoding are for efficiency, error correction,
and secrecy.
33
Chapter Two Steganography
22..1111 HHuuffffmmaann CCooddiinngg There are different ways of encoding data and one of these ways is
Huffman coding [Web06].
In 1952, D.A.Huffman published a method for constructing highly
efficient instantaneous encoding schemes. This method is now known as
Huffman Encoding [ROM96].
The idea behind Huffman coding is simply to use shorter bit patterns
for more common characters, and longer bit patterns for less common
characters [Web06].
The method starts by building a list of the entire alphabet symbols in
descending order of their probabilities .It then constructs a tree with a
symbol at every leaf, from the bottom up. This is done in steps where, at
each step, the two symbols with smallest probabilities are selected, added
to the top of the partial tree, deleted from the list, and replaced with an
auxiliary symbol representing both of them. When the list is reduced to
just one auxiliary symbol (representing the entire alphabet) the tree is
complete [SAL95].
An Example [Web06]
To encode the letters A (0.12), E (0.42), I (0.09), O (0.30), U
(0.07), listed with their respective probabilities. Go through the
following steps:
1. Consider each of the letters as a symbol with its respective
probability.
2. Find the two symbols with the smallest probability and
combine them into a new symbol with both letters by adding
34
Chapter Two Steganography
the probabilities. (Note1: There may be a choice between two
symbols with the same probability, if this is the case, a symbol
can be chosen, the final tree and codes will be different, but the
overall efficiency of the code will be the same)
(Note 2: Frequency counts or other values may be used instead of
probabilities)
3. Repeat step 2 until there is only one symbol left with a
probability of 1.
4. To see the code, redraw all the symbols in the form of a tree,
where each symbol contains either a single letter or splits up
into two smaller symbols. Label all the left branches of the
tree with a 0 and all the right branches with a 1. The code for
each of the letters is the sequence of 0's and 1's that lead to it
on the tree, starting from the symbol with a probability of 1.
Figure (2.6) Huffman Tree for example
5. Thus the codes for each letter are:
A = 100, E = 0, I = 1011, O = 11, U = 1010.
35
Chapter Two Steganography
The Huffman code for the 26- letter Alphabet [ROM96]
000 E 0.1300 0
0010 T 0.0900 0 0. 3 0 0
0011 A 0.0800 1 1
0100 O 0.0800 0
0101 N 0.0700 1 0 0.580
0110 R 0.0650 0 0.28 1
0111 I 0.0650 1 1
10000 H 0.0600 0
10001 S 0.0600 1 0
10010 D 0.0400 0 0.195 0
10011 L 0.0350 1 1 0
10100 C 0.0300 0 0.305
10101 U 0.0300 1 0 1
10110 M 0.0300 0 0.11
10111 F 0.0200 1 1
11000 P 0.0200 0
11001 Y 0.0200 1 0
11010 B 0.0150 0 0.70 0
11011 W 0.0150 1 1
11100 G 0.0150 0 0 0.115 1
11101 V 0.0100 1 0.025
111100 J 0.0050 0 1
111101 K 0.0050 1 0.010 0 1 0.045
111110 X 0.0050 0 0.020
1111110 Q 0.0025 0 0.010 1
1111111 Z 0.0025 1 0.005 1
Figure (2.7) Huffman tree for the 26-letter Alphabet
36
Chapter Two Steganography
Table (2.2) shows the letters of the alphabet with approximate
probabilities of occurrence in English, based on statistical data. The
second columns of the table show Huffman encoding scheme
(emphasizing table (2.2)) is used in this work) [ROM96].
Table (2.2) Probabilities of Occurrence in English Text
Symbol Probability Huffman code E 0.1300 000 T 0.0900 0010 A 0.0800 0011 O 0.0800 0100 N 0.0700 0101 R 0.0650 0110 I 0.0650 0111 H 0.0600 10000 S 0.0600 10001 D 0.0400 10010 L 0.0350 10011 C 0.0300 10100 U 0.0300 10101 M 0.0300 10110 F 0.0200 10111 P 0.0200 11000 Y 0.0200 11001 B 0.0150 11010 W 0.0150 11011 G 0.0150 11100 V 0.0100 11101 J 0.0050 111100 K 0.0050 111101 X 0.0050 111110 Q 0.0025 1111110 Z 0.0025 1111111
37
33
CChhaapptteerr TThhrreeee
`||vvÜÜÉÉááÉÉyyàà jjÉÉÜÜww WWÉÉvvââÅÅxxÇÇàà
YY||ÄÄxx YYÉÉÜÜÅÅttàà ))AAwwÉÉvv(*(*
Chapter Three Microsoft Word Document File Format
38
CChhaapptteerr TThhrreeee
MMiiccrroossoofftt WWoorrdd DDooccuummeenntt ffiillee 33..1111IInnttrroodduuccttiioonn Microsoft Word is a word processing software, many word versions
were written for several platforms1 including IBM PC running DOS, the
Apple Macintosh and Microsoft Windows as shown in Figure(3.1).
It is a component of the Microsoft Office System; Microsoft began
calling it Microsoft Office Word instead of merely Microsoft Word.
0
20
40
60
80
100
120
140
1983 1986 1989 1991 1995 1998 2000 2003 2006 2008
MS-DOSMacintoshWindows
ijjgjgg
Wor
d V
ersi
ons N
umbe
r
Years of Issuing
Figure (3.1) Word Versions for Different Operating Systems
1Platform: the underlying Hardware or Software for a System
Chapter Three Microsoft Word Document File Format
33..22 22HHiissttoorryy ooff WWoorrdd Many concepts and ideas of Word were brought from Bravo the
original GUI word processor developed at Xerox PARC1 [Web08].
Bravo's creator Charles Simonyi left PARC to work for Microsoft in
1981. Simonyi hired Richard Brodie, who had worked with him on Bravo,
away from PARC that summer [Web02].
Word featured a concept of "What You See Is What You Get", or
WYSIWYG, and was the first application with such features as the ability
to display bold and italics text on an IBM PC. Word made full use of the
mouse, which was so unusual at the time that Microsoft offered a bundled
Word-with-Mouse package [Web08].
Although MS-DOS was a character-based system, Microsoft Word
was the first word processor for the IBM PC that showed actual line breaks
and typeface markups such as bold and italics directly on the screen while
editing, although this was not a true WYSIWYG system because available
displays did not have the resolution to show actual typefaces[Web02].
Word 97
Word 97 had the same general operating performance as later
versions such as Word 2000. This was the first copy of Word featuring the
"Office Assistant"2, which was an animated helper used in all Office
programs [Web08].
Word 2000
For most users, one of the most obvious changes introduced with
Word 2000 (and the rest of the Office 2000 suite) was a clipboard3 that
could hold multiple objects at once. Another noticeable change was that the
2 1:Xerox PARC Research and Development Company 1970 2:Office Assistant animated helper used in all office programs
39
3: clipboard a special file or memory area (buffer) where data is stored temporary before being copied to another location used for copy and paste.
Chapter Three Microsoft Word Document File Format
Office Assistant, whose frequent unsolicited appearance in Word 97 had
annoyed many users, was changed to be less intrusive [Web08].
Word 2002
Word 2002 was bundled with Office XP and was released in 2001
although its appearance was different; it had many of the same features as
Word 2003. One of the key advertising strategies for the software was the
removal of the Office Assistant in favor of a new help system, although it
was simply disabled by default Word 2002[Web08].
Word 2003
For the 2003 version, the Office programs, including Word, were
rebranded to emphasize the unity of the Office suite, so that Microsoft
Word officially became Microsoft Office Word. Users continue to use both
names [Web08].
Word 2007
The release includes numerous changes, including a new XML-
based file format, a redesigned interface, and an integrated equation editor
[Web08].
Word 2008
Word 2008 is the most recent version of Microsoft Word for the
Mac, released on January 15, 2008. It includes some new features from
Word 2007[Web08].
40
Chapter Three Microsoft Word Document File Format
33..33 MMiiccrroossoofftt WWoorrdd DDooccuummeenntt aanndd iittss CCoommppoonneennttss [Web11]
Documents in Word have a hierarchical structure as shown in the
figure (3.2)
Figure (3.2) External Structure of a Word Document Different types of properties apply to different units in hierarchy:
Section. By default a document is a single section, but setting for
margins, headers and footers, footnote, and columns apply to
whole sections so need a section break to change any of these for
only part of a document. Make a new section using Inset| Break
and selecting one of the four types of "section breaks".
Paragraph. most of formatting in Word applies at the paragraph
level indents, line spacing, default font properties, bullets etc. can
apply many aspects of paragraph formatting all at once to a
paragraph using paragraph styles .
Character. Some formatting attributes apply at the level of
individual character, such as the bold font in the first word of this
paragraph can apply a set of character attributes together using
character styles.
41
Chapter Three Microsoft Word Document File Format
In addition to these parts of the main document, there are other special
kinds of text which word refers to as other "stories". These include
footnotes, comments, headers and footers, these items are stored separately
from the main text and require special commands to access and edit.
Customizations. such as definitions, macros and toolbars may either
be stored in the document or in the document's associated template
Styles. Are collections of format specifications which can be applied
all together to a paragraph or a group of characters. The advantage of
using styles to apply formatting is that can easily change the
formatting of all paragraphs of a certain type (e.g. examples, section,
heading or footnotes) simply by redefining the style. A linguistics
paper usually goes through a number of stages: as a term paper. As a
draft you circulate for comments as a conference handout, as a
journal submission, as camera-ready copy for a volume. Each of
these stages has its own format requirements. Using styles right from
the beginning for all formatting can save a huge amount of time over
a paper.
33..44 AAnnnnoottaattiioonn aanndd ccoollllaabboorraattiioonn ttoooollss [Web11] As a linguist, will often be working together with someone else on a
document either as a co-author, or in a student-teacher relationship.
Word has some easy-to-use tools to facilitate such collaborative work.
3.4.1 Track Changes
The “Track Changes” tool gives access to a simple method of keeping
track of the changes a particular user makes to a document. Insertions will
display in color and underlined; deletions and format changes will display
in bubbles like comments, an example of Track change can be shown in
figure (3.3) [web11]. 42
Chapter Three Microsoft Word Document File Format
Track Changes is a way for Microsoft Word to keep track of the changes
you make to a document. Track Changes is also known as redline, or
redlining. This is because some industries traditionally draw a vertical
red line in the margin to show that some text has changed [web04].
Figure (3.3) Track change example
3.4.2 Comments
The “Comment” feature allows comments to be added to the
document. In Page Layout view, recent versions of Word will be display
comments in "bubbles" on the right side of the text (moving text over to
make room in the margin for the comment). Comments from different
reviewers will appear in different colors, comments example in figure (3.4)
[web11].
Figure (3.4) comments example
43
Chapter Three Microsoft Word Document File Format
33..55 FFiillee FFoorrmmaatt [Web03]
A file format is a particular way to encode information for storage in
a computer file.
Since a disk drive, or indeed any computer storage, can store only bits,
the computer must have some way of converting information to 0s and 1s
and vice-versa. There are different kinds of formats for different kinds of
information. Within any format type, e.g., word processor documents, there
will typically be several different formats. Sometimes these formats
compete with each other.
Some file formats are designed to store very particular sorts of data:
the JPEG format for example, is designed only to store static photographic
images other file formats, however, are designed for storage of several
different types of data.
33..66 IIddeennttiiffyyiinngg tthhee ttyyppee ooff aa ffiillee [Web03]
Since files are seen by programs as streams of data, a method is
required to determine the format of a particular file within the file system
an example of metadata. Different operating systems have traditionally
taken different approaches to this problem, with each approach having its
own advantages and disadvantages as follows.
3.6.1 Filename Extension
One popular method in use by several operating systems, including
DOS and Windows, is to determine the format of a file based on the section
of its name following the final period. This portion of the filename is
44
Chapter Three Microsoft Word Document File Format
Known as the filename extension For example, HTML documents are
identified by names that end with .html (or .htm) [Web03].
3.6.2 Magic Number
An alternative method, often associated with UNIX and its
derivatives, is to store a "magic number" inside the file itself. Originally,
this term was used for a specific set of 2-byte identifiers at the beginning of
a file, but since any un decoded binary sequence can be regarded as a
number, any feature of a file format which uniquely distinguishes it can be
used for identification. GIF images, for instance, always begin with the
ASCII representation of either GIF87a or GIF89a, depending upon the
standard to which they adhere [Web03].
33..77 FFiillee SSttrruuccttuurree
Each format uses structure (a way to organize data for storing) in a file
[FOL98].
There are several types of ways to structure data in a file. The most
usual ones are described in figure (3.5).
File structure
Raw memory dumps
Chunk based format
Directory based format
(RMD) (CBF) (DBF)
Figure (3.5) File Structure Types
45
Chapter Three Microsoft Word Document File Format
3.7.1 Raw Memory Dumps/Unstructured Formats (RMD) [Web03]
Earlier file formats used raw data formats that consisted of directly
dumping the memory images of one or more structures into the file.
This has several drawbacks. Unless the memory images also have
reserved spaces for future extensions, extending and improving this type of
structured file is very difficult. On the other hand, developing tools for
reading and writing these types of files are very simple.
The limitations of the unstructured formats led to the development of
other types of file formats that could be easily extended and be backward
compatible at the same time.
3.7.2 Chunk based Formats (CBF) [Web03]
In this kind of file structure, each piece of data is embedded in a
container that contains a signature identifying the data, as well the length of
the data (for binary encoded files). This type of container is called a chunk.
The signature is usually called a chunk id, chunk identifier, or tag
identifier.
With this type of file structure, tools that do not know certain chunk
identifiers simply skip those that they do not understand. Even XML can be
considered a kind of chunk based format, since each data element is
surrounded by tags which are akin to chunk identifiers.
3.7.3 Directory based Formats (DBF) [web03]
This is another extensible format, that closely resembles a file system
(OLE Documents are actual file systems), where the file is composed of
46
Chapter Three Microsoft Word Document File Format
'directory entries' that contain the location of the data within the file itself
as well as its signatures (and in certain cases its type). Good examples of
these types of file structures are disk images, OLE documents [Web03].
33..88 SSttrruuccttuurree SSttoorraaggee The lowest level of organization that is normally imposed on a file is a
stream of bytes.
By storing data in a file which is merely as a stream of bytes, the ability to
distinguish among the fundamental information units of data will be lost.
These fundamental pieces of information are called fields. Fields are
grouped together to form records. Records are grouped together to form
Block [FOL98] as shown in figure (3.6).
In persisten
treated as a
the disk. T
file system
Block
Record
0, 1
Field Stream of bits
Figure (3.6) logic view of file
47
t storage, normally files are stored in the form of bytes. A file is
raw sequence of bytes. The entire file is stored in the blocks on
hese blocks are scattered on the disk. When reading this file, the
manages its pointers and returns a sequence of bytes [CHA00].
Chapter Three Microsoft Word Document File Format
Structure storage follows a different approach to store a file and its data on
the persistent storage. Structure storage provides a way by defining how to
treat a file as a structured collection of objects. These objects are storages
and streams as shown in figure (3.7).
Root
STORAGE STORAGE STORAGE STREAM
STORAGE STREAM STREAM
Figure (3.7) Storage and Stream Structure
A storage object is kind of a directory and it can contain other storage
objects and stream objects that can be thought of as a stream object as a
file. Like a file, a stream contains data stored as a consecutive sequence of
bytes. A compound file is a combination of these two objects [CHA00].
A compound file is file which contains different types of data saved in a
structured format having a compound file which has some text, some
images and other data. Now we want to add one more object to a file. In the
traditional approach, when saving a file, the file system rewrites the entire
data. But the structured storage approach eliminates this rewriting process
and increases the read/write performance. The new data is written to the
next available location in permanent storage and the storage object updates
the table of pointers it maintains to track the locations of its storage objects
and stream objects [CHA00].
48
Chapter Three Microsoft Word Document File Format
Here are some other benefits:
Structured storage approach provides control over separate
objects. It can read/write separate objects instead of the entire
compound file [CHA00].
More than one user can concurrently read/write the same file
[CHA00].
33..99 MMiiccrroossoofftt CCoommppoouunndd DDooccuummeenntt FFiillee FFoorrmmaatt ((MMCCDDFFFF)) A word file may contain Excel sheet and chart, an image, a table, and
some macros is an example of compound file.
Files which use MCDFF (Microsoft Compound Document File
Format) include output files from MS Office 97-2003, which consist of the
applications like MS Word, PowerPoint, and Excel [CHA00].
The Microsoft Compound Document File Format (MCDFF) 2003 is a
document file format based on OLE (Object Linking and Embedding),
which is used for saving various resources as an integrated document in
Microsoft [MIC07].
A storage component may exist as a standalone component. Each
storage component may have one or more sub-storage components and
stream components. Also the root component may have stream components
directly within it [JIT06].
49
Chapter Three Microsoft Word Document File Format
50
33..1100 SSttrruuccttuurree ooff aa WWoorrdd DDooccuummeennttss ffiilleess
Let's take a look at the structure of a Word document with an embedded
Excel object, shown below in Figure (3.8).
MS Word
JPEG Image
Object Pool
Word Document
Data Table Summary Information
Document Summary Information
CompObj
Excel Sheet
Work SummaryInformation
DocumentSummary
Book Information
Figure (3.8) Sample of Word document storage format
The binary format for Microsoft Word 97 and later versions is based on
a structure referred to as a .doc file or compound file.
A Word .doc file consists of a [MIC07]:
I. Word Document (Main stream)
II. Summary information stream
III. Table stream
IV. Data stream
V. Custom XML storage (Added in Word 2007)
Zero or more object streams which contain private data for OLE 2.0
objects embedded within the Word document [MIC07].
The 'MS Word' component is the root component containing several
streams and one storage item. Different parts of the document such as the
Chapter Three Microsoft Word Document File Format
actual contents, any table inserted, the CompObj associated with the DLL
files for the objects, the Summary Information for the content, any image
inserted, and the Document Summary Information, all take the form of
streams under the root component. The ObjectPool is the collective storage
of all the sub-storage components. Figure (3.8) displays samples of the sub-
storage Excel component. The Excel Sheet itself is a storage component
within the ObjectPool and has its own streams of information the
Workbook, SummaryInformation and DocumentSummaryInformation
[JIT06].
Custom XML Datastore (Added in Word 2007): The custom XML
data store specifies custom defined XML files contained in the
binary Microsoft Word 97 format or the Office Open XML Formats
[MIC07].
Data stream: The stream within a Word .doc file that contains
various data that anchor to characters in the main stream. For
example, binary data are described in-line pictures and/or form fields
[MIC07].
Main stream: The stream within a Word .doc file that contains the
bulk of Word‘s binary data [MIC07].
Object storage: A storage that contains binary data for an embedded
OLE 2.0 object. Multiple instances are referred to as storages
[MIC07].
Stream: The physical encoding of a Word document's text and sub
data structures in a random access stream within a .doc file [MIC07].
Summary Information Stream: The stream within a Word .doc file
that contains the document summary information [MIC07].
51
Chapter Three Microsoft Word Document File Format
Table stream: The stream within a Word .doc file that contains the
various plcf‘s and tables that describe a document‘s structures
[MIC07].
33..1111 FFoorrmmaatt ooff tthhee MMaaiinn SSttrreeaamm
The main stream of a Word binary file (complex format) consists of
the Word file header (FIB), the text, and the formatting information.
FIB (File Information Block)
The header of a Word file begins at offset 0 in the file. This gives
the beginning offset and lengths of the document's text stream and
subsidiary data structures within the file. It also stores other file status
information.
The FIB contains a "magic number" and pointers to the various
other parts of the file, as well as information about the length of the file.
The FIB is defined in the structure definition section of this document
[MIC07].
Text
The text part contains all text of the document (including footnotes,
header and footer lines, etc.) the document's text is also located in the main
stream [DIA08].
Word has used this same file format since its first version. This means
that Word 1.0 can read Word 5.0 files and vice-versa. This compatibility
was accomplished by defining all structures to be larger than they needed
to be and setting all reserved fields to zero for using in future versions.
52
Chapter Three Microsoft Word Document File Format
Reserved pointers in the document header have been used to add entirely
new document sections (such as document retrieval information and
bookmark tables) [Web09].
Because of the important issue of compatibility with future versions, all
fields in all structures which are not currently being used MUST be filled
with zeros. When the fields are finally defined for a new feature, they will
make zero either the default value of those fields or make zero represent un
initialized state which will be ignored [Web09].
33..1122 MMCCDDFFFF mmeettaaddaattaa MCDFF uses metadata to manage information about Streams,
Storage. Table (3.1) describes the type of information contained in each
metadata in MCDFF [HYU08]. Table (3.1) MCDFF Metadata
Name of metadata Information Contained Header Signature, Pointer Table of BAT
BAT Block Allocation Table
SBAT Small Block Allocation Table
Directory Stream & Storage information
The exact format structure of these metadata was provided by the
Spreadsheet Project of Open Office.org Documentation of the Microsoft
Compound Document File Format [DAN07] and the Apache POIFS
Project of Apache.org. [MAR07] because POIFS file systems are called
"file system", because they contain multiple embedded files in a manner
similar to the traditional file systems if had a word processor file with the
extension ".doc", would actually have a POIFS file system with a
document file archived inside of that file system. [MAR07].Most
53
Chapter Three Microsoft Word Document File Format
operating systems, including Microsoft Windows manage hard disk
drives by dividing their storage space into units known as partitions. So
before being able to store data on a partition, it must be formatted.
Formatting a partition organizes the associated space into what is called a
filesystem, which provides space for storing the names and attributes of
files as well as the data they contain. Microsoft Windows supports
several types of filesystems, such as FAT and FAT32,Formatting a disk
divides the disk into tracks and sectors, each track is divided into sectors
sometimes called disk blocks as shown in figure (3.9) where Partitions
comprise the logical structure of a disk drive, the way humans and most
computer programs understand the structure. However, disk drives have
an underlying physical structure that more closely resembles the actual
structure of the hardware.
Figure (3.9) the structure of a hard disk [MCC99]
MCDFF uses two types of data unit: Small Block (Sector) and Big Block
(Block) [HYU08].
If the Stream size is less than 4096, the file is stored in small blocks and
the SBAT is used to walk the small blocks (Sector) making up the file.
If the file size is 4096 or larger, the file is stored in big blocks (Blocks)
54
Chapter Three Microsoft Word Document File Format
and the main BAT is used to walk the big blocks making up the file
[MAR07].
The (zero-based) index of a sector is called sector identifier (SecID)
SecIDs are signed 32-bit integer values. If a SecID is not negative, it must
refer to an existing sector. If a SecID is negative, it has a special meaning.
–1 Free SecID Free sector, may exist in the file, but is not part of
any stream [DAN07].
–2 End Of Chain SecID Trailing SecID in a SecID chain
–3 SAT SecID Sector is used by the sector allocation table
–4 MSAT SecID Sector is used by the master sector allocation
table.
3.12.1 Compound Document Header The compound document header (simply “header” in the
following) contains all data needed to start reading a compound
document file. The header is always located at the beginning of the file;
this implies that the first sector (with SecID 0) always starts at file offset
512.The first 64 bits of the header form id or magic number identifier of
office file.
The header also contains an array of block numbers. These block
numbers refer to blocks in the file. When these blocks are read together
they form the Block Allocation Table. The header also contains a pointer
to the first element in the property table, also known as the root element,
and a pointer to the small Block Allocation Table (SBAT) [MAR07].
The block allocation table or BAT, along with the property table
specifies which blocks in the file system belong to which files [MAR07].
The Contents of the compound document header structure are
described in the following Table.
55
Chapter Three Microsoft Word Document File Format
Table (3.2) compound document header structure [DAN07]. offset Size Contents 0 8 Compound document file identifier: D0 CF 11 E0 A1 B11AE1 8 16 Unique identifier (UID) of this file 24 2 Revision number of the file format (most used is 003E) 26 2 Version number of the file format (most used is 0003) 28 2 Byte order identifier FEH FFH = Little-Endian
FFH FEH = Big-Endian 30 2 Size of a sector in the compound document file in power-of-two
(ssz), real sector size is sec_size = 2ssz bytes (minimum value is 7 which means 128 bytes, most used value is 9 which means 512 bytes)
32 2 Size of a short-sector in the short-stream container stream in power-of-two (sssz), ) real short-sector size is short_sec_size = 2sssz bytes (maximum value is sector size ssz, see above, most used value is 6 which means 64 bytes)
34 10 Not used 44 4 Total number of sectors used for the sector allocation table 48 4 SecID of first sector of the directory stream 52 4 Not used 56 4 Minimum size of a standard stream (in bytes, minimum allowed
and most used size is 4096 bytes), streams with an actual size smaller than (and not equal to) this value are stored as short-streams
60 4 SecID of first sector of the short-sector allocation table or -2 (End Of Chain SecID) if not extant
64 4 Total number of sectors used for the short-sector allocation table
68 4 SecID of first sector of the master sector allocation table or -2 (End Of Chain SecID) if no additional sectors used
72 4 Total number of sectors used for the master sector allocation table
76 436 First part of the master sector allocation table containing 109 SecIDs
The following header format structure in Table (3.3) is used to give Block
information if the file is stored in Block.
Note: The shadow cells in Table (3.3) are used in this work.
56
Chapter Three Microsoft Word Document File Format
Table (3.3) Header (block 1) -- 512 (0x200) bytes [MAR07]
Field Description Offset Length Default value or const FILETYPE Magic
number identifying this as a POIFS files system.
0x0000 Long 0xE11AB1A1E011CFD0
UK1 Unknown constant
0x0008 Integer 0
UK2 Unknown Constant
0x000C Integer 0
UK3 Unknown Constant
0x0014 Integer 0
UK4 Unknown Constant (revision?)
0x0018 Short 0x003B
UK5 Unknown Constant (version?)
0x001A Short 0x0003
UK6 Unknown Constant
0x001C Short -2
LOG_2_BIG_BLOCK_SIZE Log, base 2, of the big block size
0x001E Short 9 (2 ^ 9 = 512 bytes)
LOG_2_SMALL_BLOCK_SIZE Log, base 2, of the small block size
0x0020 Integer 6 (2 ^ 6 = 64 bytes)
UK7 Unknown Constant
0x0024 Integer 0
UK8 Unknown Constant
0x0028 Integer 0
BAT_COUNT Number of elements in the BAT array
0x002C Integer required
PROPERTIES_START Block index of the first block of the property table
0x0030 Integer required
UK9 Unknown Constant
0x0034 Integer 0
UK10 Unknown Constant
0x0038 Integer 0x00001000
SBAT_START Block index of first big block containing the small block allocation table (SBAT)
0x003C Integer -2
57
Chapter Three Microsoft Word Document File Format
SBAT_Block_Count Number of big blocks holding the SBAT
0x0040 Integer 1
XBAT_START Block index of the first block in the Extended Block Allocation Table (XBAT)
0x0044 Integer -2
XBAT_COUNT Number of elements in the Extended Block Allocation Table (to be added to the BAT)
0x0048 Integer 0
BAT_ARRAY Array of block indices constituting the Block Allocation Table (BAT)
0x004C, 0x0050, 0x0054 ... 0x01FC
Integer[ ]
-1 for unused elements, at least first element must be filled.
N/A Header block data not otherwise described in this table
N/A N/A -1
3.12.2 Byte Order [DAN07]
All data items containing more than one byte may be stored
using the Little-Endian or Big-Endian method, but in real world
applications only the Little-Endian method is used. The Little-
Endian method stores the least significant byte first and the most
significant byte last. This applies to all data types like 16-bit
integers, 32-bit integers, and Unicode characters.
Example: The 32-bit integer value 13579BDFH is converted
into the Little-Endian byte sequence DFH 9BH 57H 13H, or
to the Big-Endian byte sequence 13H 57H 9BH DFH.
58
Chapter Three Microsoft Word Document File Format
3.12.3 Sector File Offsets [DAN07]
With the values from the header it is possible to calculate a file
offset from a SecID:
sec_pos(SecID) = 512 + SecID · sec_size …………….(3.1)
= 512 + SecID · 2 ssz
Example with ssz = 10 and SecID = 5:
sec_pos(SecID) = 512 + SecID · 2 ssz
= 512 + 5 · 210
= 512 + 5 · 1024
= 5632.
Note: The previous equation is used to calculate Block Position too.
3.12.4 Property Table (Directory)
The Property Table is essentially nothing more than the directory
system. Properties (directories) are 128 byte records contained within the
512 byte blocks. Each directory entry refers to storage or a stream in the
compound document. the zero-based index of a directory entry is called
directory entry identifier (DirID). There is a special directory entry at the
beginning of the directory (with the DirID 0). It represents the root
storage and is called root storage entry [DAN07]. The contents of the
directory entry structure are described in the following table.
59
Chapter Three Microsoft Word Document File Format
Table (3.4) directory entry structure [DAN07]
Offset Size Contents 0 64 Character array of the name of the entry, always 16-bit Unicode
characters, with trailing zero character (results in a maximum name length of 31 characters)
64 2 Size of the used area of the character buffer of the name (not character count), including the trailing zero character (e.g. 12 for a name with 5 characters: (5+1)·2 = 12)
66 1 Type of the entry: 00H = Empty 03H = LockBytes (unknown) 01H = User storage 04H = Property (unknown) 02H = User stream 05H = Root storage
67 1 Node colour of the entry: 00H = Red 01H = Black
68 4 DirID of the left child node inside the red-black tree of all direct members of the parent storage (if this entry is a user storage or stream), –1 if there is no left child
72 4 DirID of the right child node inside the red-black tree of all direct members of the parent storage (if this entry is a user storage or stream), –1 if there is no right child
76 4 DirID of the root node entry of the red-black tree of all storage members (if this entry is a storage), –1 otherwise
80 16 Unique identifier, if this is a storage (not of interest in the following, may be all 0)
96 4 User flags (not of interest in the following, may be all 0) 100 8 Time stamp of creation of this entry. Most implementations do not
write a valid time stamp, but fill up this space with zero bytes. 108 8 Time stamp of last modification of this entry. Most implementations
do not write a valid time stamp, but fill up this space with zero bytes. 116 4 SecID of first sector or short-sector, if this entry refers to a stream
,SecID of first sector of the short-stream container stream, if this is the Root storage entry,0 otherwise
120 4 Total stream size in bytes, if this entry refers to a stream, total size of the short stream container stream, if this is the root storage entry, 0 otherwise
124 4 Not used The following property Format Structure in Table (3.5) is used to give
Block information if the file is stored in Block.
Note: the shadow cells in Table (3.5) are used in this work.
60
Chapter Three Microsoft Word Document File Format
Table (3.5) Property -- 128 (0x80) byte block [MAR07]
Field Description Offset Length Default value or const
NAME A unicode null-terminated uncompressed 16bit string (lose the high bytes) containing the name of the property.
0x00, 0x02, 0x04, ... 0x3E
Short[] 0x0000 for unused elements, field required, 32 (0x40) element max
NAME_SIZE Number of characters in the NAME field
0x40 Short Required
PROPERTY_TYPE Property type (directory, file, or root)
0x42 Byte 1 (directory), 2 (file), or 5 (root entry)
NODE_COLOR Node color 0x43 Byte 0 (red) or 1 (black)
PREVIOUS_PROP Previous property index
0x44 Integer -1
NEXT_PROP Next property index 0x48 Integer -1 CHILD_PROP First child property
index 0x4c Integer -1
SECONDS_1 Seconds component of the created timestamp?
0x64 Integer 0
DAYS_1 Days component of the created timestamp?
0x68 Integer 0
SECONDS_2 Seconds component of the modified timestamp?
0x6C Integer 0
DAYS_2 Days component of the modified timestamp?
0x70 Integer 0
START_BLOCK Starting block of the file, used as the first block in the file and the pointer to the next block from the BAT
0x74 Integer Required
SIZE Actual size of the file this property points to. (Used to truncate the blocks to the real size).
0x78 Integer 0
61
Chapter Three Microsoft Word Document File Format
3.14.5 Block Allocation Table (BAT)
The BAT (Block Allocation Table) is the main table for spaces
within MCDFF, which is needed to read any other Stream in the file
[HYU08].
The BAT blocks are pointed at by the bat array contained in the
header these blocks form a large table of integers. These integers are
block numbers. The Block Allocation Table holds chains of integers
[MAR07].
The elements in these chains refer to blocks in the files. The
starting block of a file is NOT specified in the BAT. It is specified by
the property of a given file. The elements in this BAT are both the block
number (within the file minus the header) and the number of the next
BAT element in the chain. This can be thought of as a linked list of
blocks. The BAT array contains the links from one block to the next,
including the end of chain marker [MAR07]. The BAT format structure
is shown in Table (3.6).
Here's an example: Let's assume that the BAT begins as follows:
BAT [0] = 2
BAT [1] = 5
BAT [2] = 3
BAT [3] = 4
BAT [4] = 6
BAT [5] = -1
BAT [6] = 7
BAT [7] = -2
62
Chapter Three Microsoft Word Document File Format
Now, if we have a file whose Property Table entry says it begins with
index 0, walk the BAT array and see that the file consists of blocks 0
(because the start block is 0), 2 (because BAT[ 0 ] is 2), 3 (BAT[ 2 ] is
3), 4 (BAT[ 3 ] is 4), 6 (BAT[ 4 ] is 6), and 7 (BAT[ 6 ] is 7). It ends at
block 7 because BAT [7] is -2, which is the end of chain marker.
Similarly, a file beginning at index 1 consists of blocks 1 and 5 and
block 5 refers to unused block.
The other special number in a BAT array is:
-3, which indicate a "special" block, such as a block used to make
up the Small Block Array, the Property Table, the main BAT, or
the SBAT [MAR07].
Table (3.6) Block Allocation Table Block [MAR07]
Field Description Offset Length Default value or const
BAT_ELEMENT Any given element in the BAT block
0x0000, 0x0004, 0x0008, ... 0x01FC
Integer -1 = unused -2 = end of chain -3 = special (e.g., BAT block) All other values point to the next element in the chain and the next index of a block composing the file.
In the physical structure of an MCDFF file, each Block is numbered with
an index number under a Header. Figure (3.10) shows the process of
accessing “Sample A Stream”. The first index number for “Sample A
Stream” is included in its Directory entry. It accesses the BAT to find the
index number of the other Blocks that “Sample A Stream” uses – in this
Example, if the first index number is 1st in Directory Entry, “Sample A
Stream” consists of three Blocks as 1st, 4th and 5th from BAT [HYU08].
63
Chapter Three Microsoft Word Document File Format
Figure (3.10) MS Compound files structure [HYU08] 3.12.6 Sector Allocation Table (SAT)
The Sector Allocation Table (SAT) is an array of SecIDs. It
contains the SecID chain of all user streams. The size of the SAT
(number of SecIDs) is equal to the number of existing sectors in the
compound document file [DAN07].
33..1133 OOffffiiccee AAuuttoommaattiioonn Office Automation /OLE Automation (later renamed by Microsoft to
just Automation) is an inter-process communication mechanism based on
Component Object Model (COM) that was intended for use by scripting
languages – originally Visual Basic – but now are used by languages run
on Windows. It provides an infrastructure whereby applications called
automation controllers can access and manipulate (i.e. set properties of or
call methods on) shared automation objects that are exported by other
64
Chapter Three Microsoft Word Document File Format
applications in OLE Automation. The automation controller is the "client"
and the application exporting the automation objects is the "server"
[Web10].
33..1144 PPIIAA ffoorr MMiiccrroossoofftt OOffffiiccee 22000033 The following tables list the PIAs available for use with Office
2003.Table (3.7) lists Microsoft Office 2003 applications and component
type libraries that have the same version number and that are signed with
the same key [KHO05].
Table (3.7) Office 2003 applications and component type libraries with the same
version number, signed with the same key [KHO05]
Office 2003 Application or component
PIA Name PIA Namespace
Microsoft Office 11.0 Object Library
Office.dll Microsoft.Office.Core
Mirosoft Word 11.0 Object Libyrar
Microsoft.Office.Interop.Word.dll Microsoft.Office.Interop.Word
33..1155 WWoorrdd OObbjjeecctt MMooddeell Word provides hundreds of objects. These objects are organized in a
hierarchy that closely follows the user interface.
Word Visual Basic Helps to contain a diagram of Word's object
model. The figure is "live" – when clicking on an object you will be taken
to the Help topic for that object. Figure (3.11) shows the portion of the
object model diagram that describes the Document object [GRA01].
The Key object in Word is Document, which represents a single, open
document; the Document object has lots of properties and methods. Many
of its properties are references to collections such as Paragraphs, Tables
and Sections. Each of these collections contains references to objects of the 65
Chapter Three Microsoft Word Document File Format
indicated type, each object contains information about the appropriate piece
of the document. For example, the Paragraph object has properties like
KeepWithNext and Style, as well as methods like Indent and Outdent
[GRA01].
Figure (3.11).Word Object Model – The Word Visual Basic Help file offers a global
view of Word's structure [GRA01].
66
Chapter Three Microsoft Word Document File Format
33..1166 PPllaattffoorrmm IInnvvookkee ((PPIInnvvookkee)) There is a need to call a function located in an unmanaged DLL
library from within the .NET framework. Platform invokes or PInvoke is
the technique used to make this happen [Web01].
Figure (3.12) a platform invokes call to an unmanaged DLL function [Web01].
When platform invoke calls an unmanaged function, it performs the
following sequence of actions [Web01]:
I. Locates the DLL containing the function.
II. Loads the DLL into memory.
III. Locates the address of the function in memory and pushes its
arguments onto the stack, marshaling data as required.
Note Locating and loading the DLL, and locating the address of
the function in memory occur only on the first call to the function.
67
IV. Transfers control to the unmanaged function.
Chapter Three Microsoft Word Document File Format
68
33..1177 AApppplliiccaattiioonn PPrrooggrraammmmiinngg IInntteerrffaacceess ((AAPPII)) [Web12] An API is a set of functions that can be used to work with a
component, application, or operating system. Typically, an API consists of
one or more DLLs that provide some specific functionality.
DLLs are files that contain functions that can be called from any
application running in Microsoft Windows.
33..1188 OOffffiiccee AApppplliiccaattiioonn PPrrooggrraammmmiinngg IInntteerrffaacceess ((AAPPIIss))
[[WWeebb0055]]
Office binary file formats are designed to be accessed through the
Office Application Programming Interfaces (APIs), instead of by direct
manipulation of the file format. Because of the complexity of the formats,
direct manipulation can cause corruption and is strongly discouraged.
The Office 97-2003 binary file formats use the Windows Structured
Storage APIs. The Office-specific information is stored as streams in this
more generalized format. Common elements, such as document properties,
can be accessed through the Structured Storage APIs.
44
CChhaapptteerr FFoouurr
ccÜÜÉÉÑÑÉÉááxxww [[||ww||ÇÇzz ffççááààxxÅÅ ||ÇÇ ;;`VVWWYYYY<
Chapter Four Proposed Hiding System in (MCDFF)
CChhaapptteerr FFoouurr
PPrrooppoosseedd HHiiddiinngg SSyysstteemm iinn MMiiccrroossoofftt CCoommppoouunndd DDooccuummeenntt FFiillee FFoorrmmaatt
((MMCCDDFFFF)) 44..11 IInntt
he proposed system is on of text Steganography methods. This
system will be used for embedding a Steganography string into
a document, which is Microsoft Word document file 2003.
rroodduuccttiioonn
The proposed System Embeds Steganography string in Unused
Block of Microsoft Compound Document Binary File format
(MCDFF). It consists of two processes for Embedding: Cover
Generation process, Embedding process as shown in the Block
diagram (4.1).
TT
69
Chapter Four Proposed Hiding System in (MCDFF)
A Z
Secret message Microsoft Word Document file (doc.)
70
Encoding Secret Message with Huffman Coding
01
Cover Generation Process
Binary message Document to be Collaborative writing efforts Embedding Process
Hiding encoded Secret message in MCDFF
Sending
Stegodocument Binary Hidden data
Extracting Secret Message from Stegodocument
0 1
Decoding Extracting hidden Data To finding Secret Message
Secret message
A Z
Figure (4.1) Block Diagram for Proposed System
Chapter Four Proposed Hiding System in (MCDFF)
44..22 CCoovveerr GGeenneerraattiioonn PPrroocceessss Cover Generation process makes data embedding disguised to be the
product of a collaborative document authoring effort. That is, the
stegodocument is made to appear to be the work of multiple authors. To
facilitate communication of the authors during the collaborative
document authoring process, the word processor records the exact
modifications by an author and embeds the ways of revision as change
tracking information into the document. From such change tracking
information, it can discern the exact changes made by a prior author, and
can recover a prior version of the document if necessary (see section 3.4
Annotation and collaboration tools).
Figure (4.2) Screenshot of Microsoft Word in case of collaborative document authoring
Figure (4.2) shows an example of the collaborative document
authoring process in Microsoft Word, where an author is modifying a
Document and the word processor has tracked the author’s
modifications.
71
Chapter Four Proposed Hiding System in (MCDFF)
Each collaborating author can accept or reject individual or all
modifications made by another author. It is a common practice for a
collaborating author to review and then accept or reject each modification
in a document first before performing his or her own corrections.
Once upon a time, Microsoft invented "Track Changes". "Authors" put
"changes" into their documents.
More recently, "Reviewers" make "revisions" to their documents and
"revisions" are one kind of "markup".
The basic idea of the proposed system is to degenerate the contents of a
cover document D to arrive at another document D' and embedding a
secret message M in D' during the Embedding process, as shown in Fig.
(4.3).The degeneration introduces errors into the degenerated document
D' such that the degenerated document appears to be a preliminary
work by a virtual author A', which is to be revised later by another
author.
Figure (4.3) Author A sends a stegodocument S with an embedded message M
to a recipient B after embedding M into a cover document D' to form S that appears to be the collaborative product of multiple authors A and A'.
72
Chapter Four Proposed Hiding System in (MCDFF)
A binary secret message M is embedded inside a cover document D' to
obtain a stegodocument S.
Microsoft Word documents have been chosen as cover media, which
provide change tracking facilities to materialize the proposed method.
Communications via Word documents are commonplace for personal,
business, or academic purposes these days and greatly used in Middle
East. The transmissions of such documents will not therefore, be under
close scrutiny.
Most of the works cited in the introduction use the technique of
modifying a cover medium to embed information. This type of data
hiding generally assumes that the cover medium used is unknown to an
adversary, or otherwise, the discrepancies between the cover medium and
the corresponding stegomedium will arouse suspicion. On the other hand,
the proposed method provides legitimate cases in using a known cover
document. For example, an already published document that is
collaboratively authored can be used as a cover document .The
stegodocument S appears to be the version of the paper before change
tracking information removal and submission for publication. The
transmission of S by one of the collaborating authors to another author, a
colleague, or a supervised student of the author is reasonable.
A colleague or a student receiving the document containing the change
tracking information can learn of the mistakes made by a colleague and
the appropriate corrections to be made thereof .
44..33 EEmmbbeeddddiinngg PPrroocceessss This method of hiding data in MCDFF is to hide information in
unused space. Unused space occurs as unused Block as follows:
73
Chapter Four Proposed Hiding System in (MCDFF)
((44..11)) AAllggoorriitthhmm ffoorr HHiiddiinngg DDaattaa
IInnppuutt:: DDooccuummeenntt ooff MMiiccrroossoofftt CCoommppoouunndd DDooccuummeenntt FFiillee FFoorrmmaatt
((MMCCDDFFFF))..
OOuuttppuutt:: SStteeggooddooccuummeenntt
SStteepp11:: OOppeenn MMCCDDFFFF ffiillee..
SStteepp22:: RReeaadd SSeeccrreett MMeessssaaggee ffrroomm uusseerr..
SStteepp33:: EEnnccooddee SSeeccrreett MMeessssaaggee wwiitthh HHuuffffmmaann CCooddiinngg..
SStteepp44:: SSeeaarrcchh ffoorr UUnnuusseedd BBlloocckk iinn MMCCDDFFFF ffiillee..
SStteepp55:: iinnsseerrtt SSeeccrreett MMeessssaaggee iinnttoo UUnnuusseedd BBlloocckk OOFF MMCCDDFFFF ffiillee..
SStteepp66:: SSaavvee tthhee ddooccuummeenntt ffiillee..
SStteepp77:: EEnndd..
74
Chapter Four Proposed Hiding System in (MCDFF)
Hiding Algorithm can be described as follows:
Step1
Open document file with Track change information (Microsoft Word
Document 2003) see Appendix B.
Step2
Enter the secret message intended for hiding.
Step3
Encode that message with Huffman coding.
Step4
In this step the Search Unused Block Algorithm is called for finding the
Unused Block Address in the document Binary file format.
Step5
After finding the Unused Block Address, it will hide the encoded secret
message in it.
Step6
Save the document file with hidden data.
Step7
End.
75
Chapter Four Proposed Hiding System in (MCDFF)
76
Open MCDFF file
Read Secret Message
Encode Secret message with Huffman coding
Start
Search for unused Block in MCDFF file
Add Secret Message into Unused Block of MCDFF
End
1
Save MCDFF file
Figure (4.4) Hiding Algorithm Flowchart
Chapter Four Proposed Hiding System in (MCDFF)
((44..22)) AAllggoorriitthhmm ffoorr SSeeaarrcchh UUnnuusseedd BBlloocckk
IInnppuutt:: DDooccuummeenntt ooff MMiiccrroossoofftt CCoommppoouunndd DDooccuummeenntt BBiinnaarryy FFiillee FFoorrmmaatt ((MMCCDDFFFF))..
OOuuttppuutt:: UUnnuusseedd BBlloocckk LLooccaattiioonn..
SStteepp11:: LLooaaddiinngg CCoommppoouunndd DDooccuummeenntt HHeeaaddeerr ooff MMCCDDFFFF ffiillee..
SStteepp22:: EExxttrraaccttiinngg iinnffoorrmmaattiioonn aanndd ooffffsseett ffrroomm HHeeaaddeerr lliikkee ((MMiiccrroossoofftt ssiiggnnaattuurree,, BBlloocckk ssiizzee,, BBlloocckk iinnddeexx ooff tthhee ffiirrsstt bblloocckk ooff tthhee pprrooppeerrttyy ttaabbllee ((ffiirrsstt DDiirreeccttoorryy)),, bbyyttee oorrddeerriinngg,, BBlloocckk AAllllooccaattiioonn TTaabbllee ((BBAATT)) IIDD,, mmiinniimmuumm ssiizzee ooff aa ssttrreeaamm))..
SStteepp33:: GGoo ttoo tthhee FFiirrsstt DDiirreeccttoorryy ((RRoooott)) AAddddrreessss..
SStteepp44:: EExxttrraacctt iinnddeexx ooff ffiirrsstt BBlloocckk iinn ffiillee ((ssttaarrttiinngg BBlloocckk))..
SStteepp55:: GGoo ttoo tthhee BBlloocckk AAllllooccaattiioonn TTaabbllee ((BBAATT)) AAddddrreessss..
SStteepp66:: LLooaaddiinngg BBlloocckk AAllllooccaattiioonn TTaabbllee ((BBAATT))..
SStteepp77:: AAcccceessssiinngg ffrroomm iinnddeexx ooff tthhee ffiirrsstt BBlloocckk iinn ffiillee ttoo aallll ootthheerr BBlloocckkss..
SStteepp88:: iiff BBlloocckk iinnddeexx == --11
-- CCaallccuullaattee tthhee AAddddrreessss ooff BBlloocckk iinnddeexx iinn ffiillee.. -- RReeccoorrdd tthhee BBlloocckk aass UUnnuusseedd BBlloocckk..
SStteepp 99:: EEllssee
IIff ((NNoott EEnndd ooff BBAATT)) GGoo ttoo sstteepp77..
SStteepp 1100:: EEnndd..
77
Chapter Four Proposed Hiding System in (MCDFF)
Search Unused Block Algorithm can be described as follows:
Step1
Loading header of document file.
Step2
Extracting from header offset and information about document file
metadata like size of block, Root ID, BAT ID, minimum size of a
stream.
Step3
After finding the block index of Root, its address in file can be
calculated by using equation (3.1)
sec_pos (SecID) = 512 + SecID · sec_size ……. (3.1)
And go to its Address.
Step4
Loading Root, and extracting from it Block index of first Block in file.
Step5
After finding BAT ID its Address can be calculated by using equation
(3.1) and go to its address.
Step6
Loading BAT.
78
Chapter Four Proposed Hiding System in (MCDFF)
Step7
Accessing from first block all other blocks in the file.
Step8
If Block index = -1, this is Unused Block so calculate its address using
equation (3.1).
Step9
If Block index < > -1 go to step7 to loading another block index and test
it until End of BAT.
Step10
End.
79
Chapter Four Proposed Hiding System in (MCDFF)
Loading BAT
Accessing from first BlockID to other Blocks in file
80
If BlockID = -1
No Yes
1
End
Record Unused Block Address
Calculate BlockID Address
Extract index of first Block in file
Go to BAT Address
Go to Root Address
Extracting information & Offset from Header
Loading Header of MCDFF file
Figure (4.5) Search Unused Block Algorithm
Flowchart
Chapter Four Proposed Hiding System in (MCDFF)
81
((44..33)) AAllggoorriitthhmm ffoorr EExxttrraaccttiinngg HHiiddddeenn ddaattaa
IInnppuutt:: SStteeggooddooccuummeenntt
OOuuttppuutt:: HHiiddddeenn ddaattaa..
SStteepp11:: OOppeenn SStteeggooddooccuummeenntt..
SStteepp22:: SSeeaarrcchh ffoorr UUnnuusseedd BBlloocckk iinn SStteeggooddooccuummeenntt..
SStteepp33:: EExxttrraacctt SSeeccrreett MMeessssaaggee ffrroomm UUnnuusseedd BBlloocckk ooff SStteeggooddooccuummeenntt..
SStteepp44:: DDeeccooddee SSeeccrreett MMeessssaaggee
SStteepp55:: EEnndd..
Chapter Four Proposed Hiding System in (MCDFF)
Extracting Algorithm can be described as follows:
Step1
Open Setgodocument (Document file + hidden data)
Step2
This step is assigned for calling Search Unused Block Algorithm for
finding Unused Block location.
Step3
Extracting Secret message from unused block.
Step4
Decode binary secret message.
Step5
End.
82
Chapter Four Proposed Hiding System in (MCDFF)
83
Figure (4.6) Extracting Algorithm Flowchart
Search for Unused Block
1
Extract Secret Message from Unused Block
Start
Open Stegodocument
Decode Secret Message
End
55 CChhaapptteerr FFiivvee
XXååÑÑxxÜÜ||ÅÅxxÇÇààttÄÄ
eexxááââÄÄààáá 99
WW||áávvââáááá||ÉÉÇÇ
Chapter Five Experimental Results and Discussion
CChhaapptteerr FFiivvee
EExxppeerriimmeennttaall RReessuullttss aanndd DDiissccuussssiioonn
55..11 IInntt
n this chapter, the Implementation of the proposed system is
explained. The proposed system is built using Microsoft Visual C
sharp .Net 2003 under Windows Xp as Operating System,
Microsoft Word Document 2003, Office Automation Technique provided
by Microsoft.
rroodduuccttiioonn
II To hide a secret message in Unused Block we must get Microsoft
Office Word 2003 Binary File Format Specification and because document
File Format developers view their specification documents as trade secret,
therefore do not release them to the public; Start working with Automation
Technique provided by Microsoft (See section 3.13 office Automation).
This Technique is also used in IEEE Research published in March 2007
"New Steganographic method for data hiding in Microsoft Word
Documents by a Change Tracking Technique" instead of Microsoft Office
Word Binary File Format Specification.
In order to work with Word data and its application to exchange data
with other applications, Automation technique allows return, edit, and
export data by referencing another application's objects, properties, and
methods.
Accessing Word components from C# isn’t quite as straightforward as
many other features of C# and the .NET FrameworkT simply is needed to
know what to reference and how to use the components.
84
Chapter Five Experimental Results and Discussion
Microsoft announced that its two core strategic technologies were Win32
API and the Component Object Model (COM). The Win32 API is
supported on all Windows operating systems, including 16-bit systems
(Microsoft Windows 95, Windows 98, and Windows Millennium Edition)
and 32-bit systems (Windows 2000, and Windows XP)
At that time .NET Framework is firstly introduced by Microsoft, the
concept of managed code and unmanaged code as two different
programming models was introduced as well. Microsoft defines that
managed code as the code generated by the .NET Framework and could be
executed by the common language runtime (CLR).
The common language runtime manages memory and validates code to
make sure it doesn't attempt to perform illegal operations such as access
memory that doesn't belong to it. The runtime provides access to Microsoft
.NET Framework and the Base Class Libraries.
On the other hand, the unmanaged code is any other code that doesn't
match the pervious definition. As a result, all the code created and
generated before the .NET Framework is released considered unmanaged
code. This unmanaged code contains WIN32 APIs, valuable external
libraries, COM components, and COM+ services, and all of these are so
useful and important.
The dilemma now is: the .NET Framework which is used in this work is
the current development environment only accepts managed code. It has
already made valuable libraries and components but all are unmanaged
codes so there is a need to use this valuable unmanaged code while
working under the .NET Framework environment.
85
Chapter Five Experimental Results and Discussion
to solve this problem and for backward compatibility we find that
Microsoft fires another concept and calls it "Interoperating with
unmanaged code" Which is how to call or use unmanaged code form
within managed code and vice versa. Then it divides this process into two
categories which are:
I. Framework (managed code) using the COM Interop technique - How
to call WIN APIs and DLLs (unmanaged code) form within the
.NET Framework (managed code) using the Platform Invoke
(PInvoke) technique.
II. How to use COM components (unmanaged code) from within the
.NET.
Start working on second category:-
In order to work with COM objects exposed by the Office applications
2003 Microsoft created a set of a primary interop assembly (PIA), Primary
interop assembly allows managed Visual C#.net to communicate with the
host application's COM-based object model, Visual studio Tools for the
Microsoft Office System uses PIA.
To get the correct assemblies referenced for Word, the name of assemblies
will vary based upon the version of Word that has, in this case, PIAs
provided by Visual Studio.NET 2003 and included in the Office 2003
family of products see (Table (3.7)). It mentions that working on Visual
Studio 2005 and Microsoft Word Document 2003 can't be Word reference.
The following Software must therefore be installed:-
Microsoft .NET Framework 1.1.
Microsoft office Word 2003 including the necessary Primary Interop
Assembly.
86
Chapter Five Experimental Results and Discussion
87
Visual Studio 2003.
To Referencing the Word assemblies follow these steps:
I. On the project menu, click Add Reference.
II. On the Com, locate Microsoft Word Object Library, and then
click Select.
III. Click OK in the Add References dialog box to accept your
selections as shown in figure (5.1).
Figure (5.1) word Reference
After references set up, we can begin using the Word components however;
these components are a little tricky to deal with and can act in unexpected
ways. These objects work by basically creating an instance of Word under
the current session and giving access to Word’s functionality.
In order to automate an application, we must know the object model that is
employed by the target application exporting activation objects. This
requires that the developer of the target application publicly document its
object model. The development of automation controllers without
Chapter Five Experimental Results and Discussion
knowledge of the target application's object model is "difficult to
impossible". Because of these complications, Automation components are
usually provided with type libraries which contain metadata about classes,
interfaces and other features exposed by an object library.
Microsoft has publicly documented the object model of all of the
applications in Microsoft Office, and some other software developers have
also documented the object models of their applications. Object models are
presented to automation controllers as type libraries.
The results of working with Word Object are explained bellow- for full
details about Word Object (See section 3.15 Word Object Model)
Modify Text Format in Document, count Characters in Document, and
modify Table Format in Document, Hide Text in Document and many
other processes.
These results could not access to Binary File Format but could call
Microsoft Word Document from Visual C# and there is no need to build
Text Editor for loading doc. File as customary since this is a new method.
To access Binary File Format we must get its specification from source
United State - Microsoft Company, and try by any way to find it by
corresponding many authorities and many web sites belonging to Microsoft
Developer Network (MSDN), to be able to get the answer. They want
Legal Agreement to supply us with Binary File Format I emit the FAX to
the company and supply us with Microsoft Office Word 97- 2007 Binary
File Format Specification. But to my surprise, it was very complex and
since it was a non- public format, it was supported by a few programs. To
get access to it from programming, we should depend on first category:
PInvoke technique (See section 3.16 Platform Invoke).
88
Chapter Five Experimental Results and Discussion
Accessing Unused Block is shown in figure (5.2). Having reached that, it
wasn't possible to access Compound Header of doc. File without entering
from Root using APIs (See section 3.17 Application Programming
Interfaces & section 3.18 Office Application Programming Interfaces).
This can use the following function:-
I. Structure storage API
StgOpenStorageEx function: opens an existing root object in
Compound files.
Note: all Windows 2000, Windows XP, and Windows Server 2003
applications should call StgOpenStorageEx, instead of StgOpenStorage.
The StgOpenStorage function is used for compatibility with Windows
2000 and earlier applications.
II. COM provides two interfaces to access compound file IStorage
and IStream.
IStorage Interface provides methods that can be performed on storage.
IStream Interface is used to read and write data to stream. Structure storage API
Root
Finding BAT
Offset BAT
Header
Main Stream
Table Stream
Istorage interface
Object pool
Istream Interface
Figure (5.2) Block Diagram for Unused Block Path in
document File
89
Chapter Five Experimental Results and Discussion
55..22 SSyysstteemm IImmpplleemmeennttaattiioonn In this section, the stages of the system will be discussed; these stages
are shown in Figure (5.3).
Figure (5.3) the main menu for the proposed system.
55..22..11 DDooccuummeenntt bbeeffoorree HHiiddiinngg
Having opened Cover Document file, the Tracking Change Tool will be
used to modify document to be like collaborative writing between many
authors. The cover document is shown in figure (5.4):
Figure (5.4) Cover Document before Track Change
90
Chapter Five Experimental Results and Discussion
When the button Document is clicked before hiding, it will open the window
in figure (5.5):
Figure (5.5) Cover Document after Track change
55..22..22 EEmmbbeeddddiinngg PPrroocceessss This stage is the primary stage in the proposed system. It describes the
implementation of embedding method in the following steps:
The first step: is to read the compound header, the FIB has a fixed length
of 1472 byte the first bytes of my Cover file are:
00000000:D0 CF 11 E0 A1 B1 1A E1 00 00 0000000000 00 00000010:00 00 00 00 00 00 00 00 3E 0003 00 FE FF09 00 00000020:06 00 00 00 00 00 0000 00 00 00 00 010000 00 00000030:3F 00 00 00 00 00 00 00 00 10 0000 4100 00 00 00000040:01 00 0000 FE FF FF FF 00 00 00 00 3E 00 00 00 00000000H D0 CF 11 E0 A1 B1 1A E1 00 00 00 00 00 00 00 00
1) 8 bytes containing the fixed compound document file identifier (magic number).
00000010H 00 00 00 00 00 00 00 00 3B 00 03 00 FE FF 09 00
2) 2 bytes containing the byte order identifier. It should always consist of the byte sequence FEH FFH.
91
Chapter Five Experimental Results and Discussion
00000010H 00 00 00 00 00 00 00 00 3B 00 03 00 FE FF 09 00
00000020H 06 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00
3) 2 bytes containing the size of sectors (small Block) or size of Block (big block) the size is 512 bytes, 2 bytes containing the size of short-sectors or size of small Block size is 64 bytes here.
00000020H 06 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00
4) 4 bytes containing the number of sectors used by the sector allocation table or Number of elements in the BAT uses only one sector or Block here.
00000030H 3f 00 00 00 00 00 00 00 00 10 00 00 41 00 00 00
5) 4 bytes containing the SecID of the first sector used by the directory or Block index of the first block of the property table. It starts at sector or Block 63 here
00000030H 3f 00 00 00 00 00 00 00 00 10 00 00 41 00 00 00
6) 4 bytes containing the minimum size of standard streams. This size is 00100000H = 4096 bytes here. This leads to the file stored in big Blocks and the main BAT is used to walk the big blocks making up the file.
00000040H 01 00 00 00 FE FF FF FF 00 00 00 00 3e 00 00 00
7) 4 bytes containing the Block index of BAT it starts at block 62 here. The second step: finding starting block of a file specified by the property (Directory) its size is 128 bytes: 00008000: 52 00 6F 00 6F 0074 00 20 00 45 00 6E 007400 00008010: 72 00 79 00 0000 00 0000 00 00 00 00 000000 00008020: 00 00 00 00 00 00 00 00 00 00 0000000000 00 00008030: 00 00 00 00 00 0000 00 00 00 00 000000 0000 00008040: 16 00 05 01 FF FF FF FF FF FF FF FF 03 0000 00 00008050: 06 09 02 00 000000 00 c0 00 0000 00 0000 46 00008060: 00 00 00 00 00 0000 00 00 00 00 00 e012 2f4e 00008070: b8 48 c9 01 42 0000 00 80 0000 0000 00 0000
92
Chapter Five Experimental Results and Discussion
00008000 52 00 6F00 6F 00 74 00 20 00 45 00 6E 00 7400 00008010 72 00 79 00 00 00 00 00 00 00 00 00 00 000000 00008020 00 00 00 00 00 00 00 0000 00 00000000 00 00 00008030 00 00 00 00 00 0000 00 00 00 00 00000000 00
1) 64 bytes containing the character array of the entry name (16-bit characters, terminated by the first <00> character. The name of this entry is "Root Entry" here.
00008070 b8 48 c9 01 42 0000 00 80 0000 000000 0000
2) 4 bytes containing starting block of the file, used as the first block in the file and the pointer is to the next block from the BAT.
The third step: Loading BAT array to accessing the Unused Block and hiding the secret message in it. The Block allocation Table will be for this cover 0 1 2 3 4 5 6 7 8 9 10 11 12 … … 1 2 3 4 5 6 7 8 9 10 11 12 … -1 …
The secret message will be:
THERE ARE ELEVEN GUARDS OUT TWENTY IN COUNTER AT
TEN PM FROM CELING OUR TARGET DIAMOND
The fourth step: Encoding the secret message with Huffman Coding:
0010 10000 000 0110 000 0011 0110 000 000 10011 000 11101 000 0101 11100 10101 0011 0110 10010 10001 0100 10101 0010 0010 11011 000 0101 0010 11001 0111 0101 10100 0100 10101 0101 0010 000 0110 0011 0010 0010 000 0101 11000 10110 10111 0110 0100 10110 10100 000 0111 10011 0111 0101 11100 0100 10101 0110 0010 0011 0110 11100 000 0010 10010 0111 0011 10110 0100 0101 10010
The fifth Step: hiding secret message in Unused Block.
93
Chapter Five Experimental Results and Discussion
When the embedding process button is clicked, it will open the window in
figure (5.6):
Figure (5.6) the Embedding Process Window
After writing the secret message, the Embed button must be pressed to hide
this message.
Button Exit will close the current form.
55..22..33 DDooccuummeenntt aafftteerr HHiiddiinngg:: This button will open the document after hiding a secret message as
shown in figure (5.7):
Figure (5.7) Document after Hiding
94
Chapter Five Experimental Results and Discussion
That message shows that this Document contains Tracking Change
information and will ask if you want to continue saving this information.
55..22..44 EExxttrraaccttiinngg PPrroocceessss This stage describes the implementation of Extracting method in the
following steps:
The first Step: is to read the compound header as in Embedding process.
The second step: finding starting block of a file.
The third step: Loading BAT array to accessing the Unused Block and
extract the secret message from it.
The fourth step: Decoding the secret message with Huffman Coding.
When the Extraction Process button is clicked, it will open the window in
figure (5.8):
Figure (5.8) Extracting Process Window
To extract the hidden data press button Extract.
Button Exit will close the current form.
95
Chapter Five Experimental Results and Discussion
55..33 CCoommppaarriissoonn bbeettwweeeenn tthhee pprrooppoosseedd SSyysstteemm aanndd tthhee mmoosstt
ppooppuullaarr TTeexxtt hhiiddiinngg MMeetthhooddss This work differs from other Text hiding Methods by the following: Table (5.1) Comparison between the proposed System and other Text hiding Methods
THE PROPOSED SYSTEM TEXT HIDING SYSTEMS 1. The difference between
document after hiding and Stegodocument which is opposite on apparent Text is not found.
The difference between Cover and Stegodocument which is opposite on apparent Text is found in hiding method like interline, inter Word
2. The hidden data is not related to Text Cover it can be English or Arabic Text.
Hidden data may be related to Text Cover.
3. No problem was detected on hidden data at Stegodocument mailing or copying.
Some programs like "send mails" may in advertently remove the extra space characters in space hidden data.
4. Must access Binary File Format that describes exactly how the data is to be encoded, how accessing to Unused Block to hiding data.
Does not need to know Binary File Format.
5. Using Track Change Tool does not affect hidden data.
This Tool has not yet been used in related work.
6. Could not be detected by the Software that detect any change with character Feature.
Can be detected by that Software
7. In this work, it was found that: Cover Size=34KB Hidden Size= 63Byte Informed about size of empty document = 10/11 KB
Taking the Open Space method, Inter-Sentence Spacing requires a great deal of text to encode a very few bits (one bit per sentence). This equates to a data rate of approximately one bit per 160 bytes assuming sentences are not on average to 80 character lines of Text.
96
66 CChhaapptteerr SSiixx
VVÉÉÇÇvvÄÄââáá||ÉÉÇÇáá
99 ffââzzzzxxááàà||ÉÉÇÇáá yyÉÉÜÜ YYââààââÜÜxx jjÉÉÜÜ~~
Chapter Six Conclusions and Suggestions for Future Work
CChhaapptteerr SSiixx
CCoonncclluussiioonnss aanndd SSuuggggeessttiioonnss ffoorr FFuuttuurree
WWoorrkk
6.1 Conche proposed System provides a new method for embedding Text
in Text, a number of conclusions were derived from this study:-
6.1 Conclluussiioonnss
I. The Cover Generation process in Hiding System will increase
the Security of Hidden System and avoid drawing suspicions
that there is hidden data.
TTII. Hidden data in Document will not be affected by copying or
mailing the Stegodocument.
III. The proposed system hides English Text in another Text
and gives good results.
IV. This method of hiding data in MCDFF is only a few of
many ways to hide or encrypt data.
V. The difference between the original Cover-Text size and
Cover –Text size after embedding process is acceptable, for
example in this Case :
Original cover-Text size is (34 KB),
Cover –Text size after embedding is (34.5 KB),
Hidden Message size (63 bytes),
The size of the empty Document is 10KB/ 11 KB.
97
Chapter Six Conclusions and Suggestions for Future Work
66..22 SSuuggggeessttiioonnss ffoorr FFuuttuurree WWoorrkk Information will be hidden more wonderfully in a creative way with
the rapid development of science and technology. Much more new methods
and new technologies will rise, and there will be bigger space for the
development of hidden information technology.
Many Suggestions can be given for future Work
I. The System could be modified to be implemented on other
Microsoft office files 2003 like Microsoft power point (.ppt),
Microsoft Excel (.xls), etc…
II. It is possible to use Encryption process before Embedding
process this will increase the Security of the system.
III. In MCDFF it is possible to use another digital warren
like Slack space.
IV. It is possible to use Compression process after Encoding
process to compress Huffman Coding.
V. It is possible to use secret key steganography instead of pure
steganography to implement the proposed system.
98
RReeffeerreenncceess
[ABD01]
Abdul Wahab, H., B.,"Information Hiding in written Text Using Context
Free Grammar (CFG) ", Msc. Thesis, University of Technology,
Department of Computer Science and Information System, Baghdad, 2001.
[ACK07]
Ackley, S., R.," Word File Format ", Apache POI – HWPF – Java API to
Handle Microsoft Word Files, 2007.
[ALD05]
Al-Dhao, T., A. and Rahma, S., A., "Analysis of Information Hiding
Techniques in the Text", Engineering & Technology Journal Vol. 24, No.
6, 2005.
[ALS01]
Al-Shamkhy, R., A.," Hiding Text in Text Using Dictionary Method"
MSc. Thesis, Department of Computer Science and Information System,
Baghdad, 2001.
[BAK05]
Baker, E., J., "Image Watermarking Using Coarseness and Wavelet
Transform", Msc. Thesis, University of Technology, Information Institute
for Postgraduate Studies, 2005.
[BER06]
Berghel, H., Hoelzer, D., and Sthultz, M., "Data Hiding Tactics for
Windows and Unix File System",
http://www.berghel.net/publications/data_hiding.php, May 26, 2006.
[BER05]
Bergman C. and Davidson J.," Unitary Embedding for Data Hiding with
the SVD", Security, Steganography, and Watermarking of Multimedia
Contents VII, SPIE Vol. 5681,San Jose, CA, Jan. 2005,URL:
http://orion.math.iastate.edu/cliff/manuscripts/svdstego.pdf
[CAC98]
Cacciaguerra, S., and Ferretti, S., "DATA HIDING:STEGANOGRAPHY
AND COPYRIGHT MARKING" ,
http:// www.cs.unibo.it/~scacciag/home_files/teach/datahiding.pdf, 1998.
[CHA00]
Chand, M., "Structure Storage: A COM way to read/write persistent
data",
http://www.dotnetheaven.com/Uploadfile/mahesh/_com104252005081250
AM/_com1.aspx?ArticleID=307eca4f-723b-4ed5-b823-2a05e71ai402,
June 26, 2000.
[CUM04]
Cummins J., Diskin P., Lau S. and Parlett R., "Steganography And
Digital Watermarking ", School of Computer Science, The University of
Birmingham , 2004.
[DAN07]
Daniel, R., "OpenOffice.org's Documentation of the Microsoft
Compound Document ", OpenOffice.org, the Speardsheet Project, June
2007.
[DIA08]
Dialogika, Makz, Math, Wk and Divo, "How to Retrieve Text from a
Binary .doc File", March 2008.
[DIC07]
Dickman, D., S., "An Overview of Steganography ", July 2007.
[DOB97]
Dr. Dobb's Journal, Jannary, "Steganography for Dos Programmers",
1997.
[DUC01]
Ducan, S., "An Introduction to Steganography", Intenet Surveys, 2001.
[DUN02]
Dunbar, B., "A detailed look at Steganographic Techniques and their use
in an Open-Systems Environment", SANS Institute, 2002.
[ETT98]
Ettinger, J., M., "Steganalysis and Game Equilibria", Information Hiding
Seconed International Workshop, Processing, And Vol.1525 of lecture
notes in Computer Science,Springer, and 1998 pp.319-328.
[FOL98]
Folk, M., J., Zoellick, B. and Riccardi, G., "File Structures an Object-
Oriented Approach with C+ +", ADDISON-WESLEY, 1998.
[GRA01]
Granor, T., E., "Session FT-Automating Microsoft Word", Automating
Microsoft Word Fox Teach 2001, page 38, 2001.
[HYU08]
Hyukdon K., Yeog K. and Sangjin L., "A Tool for Detection of Hidden
Data in Microsoft Compound Document File Format" , 2008
International Conference on Information Science and Security © 2008
IEEE , 2008.
[JAJ98]
Jajodia, S., and Johnson, N., F., "Steganalysis of Image Greated Using
Current Steganography Software", Information Hiding: Second
International Workshop, Processding, Vol.1525 of Lecture Notes in
Computer Science, Springier, 1998, PP.273-289.
[JIT06]
Jithra, K., "Microsoft Office Security, Part one",
http://www.securityfocus.com/infocus/1874, 2006-08-22.
[JOH01]
Johnson N. F., Duricn Z. and Jajodia S., "Information hiding:
steganography and watermarking attack and countermeasures", kluwer
Academic publishers, USA, 2001.
[JOH98]
Johnson, F. and Jajodia, S., "Steganalysis: The Investigation of Hidden
Information," in Proc. IEEE Information Technology Conf., Syracuse,
NY, Sep.1998, pp.113-116.
[JOH99]
Johnson, N.F., "Steganography ", an Internet Survey, 1999.
[KAH96]
Kahn D, "The History of Steganography ", Information Hiding: First
International Workshop. Proceedings, Vol. 1174 of Lecture Notes in
Computer Science, Springer, 1996, PP, 4-5.
[KAT00]
Katzenbeisser S. and Peticolas F., "Information Hiding Techniques for
Steganography and Digital Watermarking", Artech House Inc, USA,
2000.
[KHO05]
Khor, S., M. and Leonard, A., "Installing and Using the Office 2003
Primary Interop Assemblies", Microsoft Corporation, January 2005.
[KRE04]
Krenn, R., "Steganography: Implementation & Detection", found online
at
http://www.krenn.nl/univ/cry/steg/presentation/2004-01-21-presentation-
steganography.pdf, 2004.
[KUO70]
Kuo, F., F, "An introduction to error-correcting codes", 1970, PP 225-
231.
[LIU07]
Liu T.-Y. and Tsai W.-H., "A New Steganographic Method for Data
Hiding in Microsoft Word Documents by a Change Tracking Technique
", IEEE Transactions on Information Forensics And Security, Vol. 2, No. 1,
March 2007.
[MAR07]
Marc, J., "POIFS File System Internals", the Apache POI Project, the
Apache Software Foundation, 2007.
[MCC99]
McCarty, B., "Learning Debian GUN/LINUX", O'REILLY Online
Catalog, Chapter two, September 1999.
[MIC07]
Microsoft Open Specification Promise, "Microsoft Office Word 97-2007
Binary File Format (.doc) Specification", © 2007 Microsoft Corporation.
[MIC99]
Microsoft Crop., "OLE Concepts and Requirements Overview", October
1999.
[MIK07]
Mikhail, R., M., " Information Hiding Using Petri Nets and Wavelet
Transform " , Msc. Thesis, University of Technology, Department of
Computer Science and Information System, Baghdad, 2007.
[MIN06]
Ming, C., Ru, Z., Xinxin, N., and Yixian, Y., "Analysis of Current
Steganography tools: Classifications & Features", International
Conference on Intelligent Information Hiding and Multimedia Signal
Processing, © 2006 IEEE.
[MOR00]
Morkel T., Eloff J., and Olivier M., "An overview of image
Steganography", Information and Computer Security Architecture (ICSA)
Research Group, Department of Computer Science University of Pretoria,
2000, URL:
http://icsa.cs.up.ac.za/issa/2005/Proceedings/Full/098_Article.pdf.
[RAN03]
Randall, B., A., "Visual Studio Tools for the Microsoft Office System",
MCW Technologies, LLC, April 2003.
[RIM97]
Rimell J., "Data Hiding Inside TIFF Images", John's Collage,
Cambridge, England, 1997.
[ROC08]
Rocha, A. and Goldenstenin, S., "Information Hiding: types and
Applications ", IEEE WVU, Anchorgr-2008, 2008.
[ROM96]
Roman, S., "Introduction to Coding and Information Theory", 1996.
[SAL95]
Salomon, D., "Data Compression ", the complete reference, Springer,
PP.38-39, 1995.
[VIL06]
Villan, R., Voloshynovskiy, S., Koval, O.,Vila, J., Topak, E., Deguillaume,
F., Rytsar, Y., and Pun, T., "Text Data-Hiding for Digital and Printed
Documents: Theoretical and Practical Considerations", Computer Vision
and Multimedia Laboratory – University of Geneva, 2006.
[XIU06]
Xiuhui G., Renpu J., and Jiazhen W., "Research on Information Hiding ",
US-China Education Review, ISSN1548-6613, USA, Vol. 5, No. 3 (Serial
No. 18) May 2006.
[YOS06]
Yoshioka, K., Sonoda, K., and Takizawa, O., "Information Hiding on
Lossless Data Compression", International Conference on Intelligent
Information Hiding and Multimedia Signal Processing © 2006 IEEE, 2006.
Websites [Web01]
"A Closer Look at Platform Invoke",
Website: http://msdn.microsoft.com/en-us/library/0h9e9t7d(vs.71).aspx.
[Web02]
"Bravo (software)",
Website: http://en.wikipedia.org/wiki/Bravo_(software).
[Web03]
"File Format",
Website: http://en.wikipedia.org/wiki/File_format.
[Web04]
"How does Track Changes in Microsoft Word Work?"
Website:
http://www.shaunakelly.com/word/trackchanges/HowTrackChangesWork.
html.
[Web05]
"How to extract information from Office files by using Office file formats
and schemas",
Website: http://Support.microsoft.Com/kb/840817/en-us.
[Web06]
"Huffman Coding",
Website:
http://www.si.umich.edu/Classes/540/Readings/Encodings/Encoding%20-
%20Huffman%20Coding.htm.
[Web07]
"Microsoft Office",
Website: http://en.wikipedia.org/wiki/Microsoft_Office.
[Web08]
"Microsoft Word",
Website: http://en.wikipedia.org/wiki/Microsoft_Office_Word.
[Web09]
"Microsoft Word 5.0 (PC) Binary File Format",
Website: http://www.msxnet.org/word2rtf/formats/dosword5.
[Web10]
"OLE Automation",
Website: http://en.wikipedia.org/wiki/OLE_Automation.
[Web11]
"Structure of a Word document",
Website:
http://www,linguistics.ucsb.edu/facutly/cumming/WordForLinguists/Struct
ure.htm.
[Web12]
"What Is API",
Website:
http://msdn.microsoft.com/en-us/library/aa141380(office.10).aspx.
A
V
Appppeennddiixx
AA `||vvÜÜÉÉááÉÉyyàà jjÉÉÜÜww WWÉÉvvââÅÅxxÇÇàà VVÉÉääxxÜÜ
uuxxyyÉÉÜÜxx ggÜÜttvv~~áá V{{ttÇÇzzxx
AAppppeennddiixx BB
`||vvÜÜÉÉááÉÉyyàà jjÉÉÜÜww WWÉÉvvââÅÅxxÇÇàà VVÉÉääxxÜÜ
ttyyààxxÜÜ ggÜÜttvv~~áá VV{{ttÇÇzzxx
A
Appppeennddiixx
CC
;Y\U<
Appendix C Structure of File Information Block (FIB) [MIC07] In Word version 8, the FIB is reorganized to make future extension easier, and to make it easier to make backward compatible file format changes. The FIB now consists of four substructures: the header and three arrays. The FIB header, is unchanged from past versions. The second part is an array of 16-bit ―shorts, most of which were present in earlier versions in different locations. The third part is an array of 32-bit longs, many of which were scattered through the previous version FIB. Finally, there is an array of FC/LCB pairs, which were divided into several disjoint arrays in the previous FIB. Future versions of Word will add entries to the three arrays, so readers of the FIB must be careful to skip over any entries in each array that were not present in the version for which the reader was designed. Writers of the FIB must write exactly as many entries as was defined for the nFib value they put in the FIB. The FIBFCLCB structure, used in an array in the FIB: Deximal Hex Name Type Bitfield
Size Bitfield size
Comments Introduced
0 0x0000 Fc Long File position where data begins.
4 0x0004 Lcb ulong Sizeof Data.Ignore fc if lcb is zero
The FCPGDOLD structure, referenced in the FIB, used internally by Word: Deximal Hex Name Type Bitfield
Size Bitfield size
Comments Introduced
0 0x0000 FcPgd Long File position where data begins.
4 0x0004 LcbPgd ulong Sizeof Data.Ignore fc if lcb is zero
8 oxoooc fcBkd long File position where data begins.
12 0xoooc lcbBkd ulong Size of data.Ignore fc if lcb is zero
The FCPGD structure, referenced in the FIB, used internally by Word. This modified version of the above structure was introduced in Word 2003: Deximal Hex Name Type Bitfield
Size Bitfield size
Comments Introduced
0 0x0000 FcPgd Long File position where data begins.
Word 2003
4 0x0004 LcbPgd ulong Sizeof Data.Ignore fc if lcb is zero
Word 2003
8 oxoooc fcBkd long File position where data begins.
Word 2003
12 0xoooc lcbBkd ulong Size of data.Ignore fc if lcb is zero
Word 2003
16 0x0010 fcAfd Fc File position where data begins
Word 2003
ßaßaخخþþ@@––óó@@
االمن ليس مسؤولية أو أمتياز الحراس أو وآالء االمن فقط ,للفرد والمجتمع والعالم االمن مطلب
االمن اهتمام آل شخص حيث ان ابقاء الباب مغلق هي مسؤولية آل شخص يمر خالل ذلك الباب
.صنفه أو وضعه في الحياة,لونه , بغض النظر عن طوله
ا بحت ابح صال اص ب وات ع الوي ل مواق سبب آ ات ب ة البيان اث امني ب ابح ات قل اء البيان ث اخف
.الصوت والصورة وهكــــــــــــــــــــــذا,الشبكات يعتمد على الفيديو
ة االدراك دون اضعاف نوعي سرية بوسط رقمي ب تقنية أخفاء البيانات ممكن تخفي المعلومات ال
خاص ة االش ث بقي ط بحي ذلك الوس سي ل ذلك الح رية ب ات س ود معلوم درآوا بوج ن ان ي ال يمك
.الوسط
ة لنظام الحاسوب هذة االطروحة اقترحت طريقة لفن االخفاء باالستفادة من الخصائص الفيزياوي
ل ه ل ة خزن د ) .doc(وآيفي ف معق ه آمل ل ومعالجت ث فاي تخدمأبحي ةال تس ستخدمكتل ر م ة الغي
دة لملف مايكروسوفت ورد في الهيكلية الم الخفاء البيانات )ةالفارغ( تفادة من عق م االس ذلك ت وآ
.االمكانيات التي يوفرها برنامج مايكروسوفت ورد آأدواته لتوليد الغطاء
ة ي نص بطريق رح يخف ام المقت ين steganography النظ تخدام عمليت ر باس نص اخ ة : ب عملي
.عملية التضمين و الغطاء توليد
امج : عملية توليد الغطاء ائق برن 2003 اصدار ورد مايكوسوفت بما ان الغطاء هو وثيقة من وث
. انتاج جهود آتابة تعاونية بين عدة مؤلفين آانهبدور ليهظسي
ضمين ة الت صية با :عملي سلة ن ي سل ةلتخف ستخدم كتل ر م ة( ةالغي ة)الفارغ ة الثنائي ذلك بالهيكلي ل
.لملفا
ا ههذ اء ببرن ذي هو احد االطروحة قدمت نظام لالخف ورد وال ات مج ال نظام مايكروسوفت تطبيق
ي ةالمكتب ى بقي د االطالع عل ات وبع دنا ان التطبيق ه وج ات ب ة التطبيق اط ضعف عن بقي ل نق اق
.شورةنباالعتماد على اخر االبحاث الم
على نظام التشغيل وندوز اآس بي على حاسوب 2003هذا النظام نفذ باستخدام لغة السي شارب
. آيكا هيرتز 2.00 مع ذاآرة آيكا بايت ومعالج 4وع بينتوم محمول ن
<<<<<<<<<Ñ]†ÃÖ]<íè…çã¶< <
êÛ×ÃÖ]<ovfÖ]æ<êÖ^ÃÖ]<Üé×ÃjÖ]<ì…]‡æ<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <
íéqçÖçßÓjÖ]<íÃÚ^¢]<<<<<<<<< <
l^f‰^£]<Ýç×Â<ÜŠÎ<<<<<<<<< <
ý@óÕî‹ ý@óÕî‹ خخöbÐöbÐ@pbäbïjÜapbäbïjÜa@@ÀÀ@óïÝÙïè@@@óïÝÙïè@@†Šì@oÐíì‹Ùîb¾a@Ö÷bqì@†Šì@oÐíì‹Ùîb¾a@Ö÷bqì@@@obiobiخخa†a†@@ãã@@@@óïåÕmóïåÕm@@Êjnm@Êjnm@@@ïÍnÜa@ïÍnÜa@
رسالة مقدمة الى قسم علوم الحاسبات في الجامعة التكنولوجية وهي جزء من متطلبات نيل شهادة الماجستير في علوم الحاسبات
تقدمت بها
אאאא
باشراف
Ù]Ù]ددfÂ<…çjÒfÂ<…çjÒ{{ددß¹]<ß¹]<{{‘<ÜÑ<ÜÃ{{<çe]<^<çe]<^<éf<éf<<
View publication statsView publication stats
top related