Steganographic Method for Data Hiding in Microsoft … · To the City of Science and its Teacher ………… Prophet "Mohamed" To my injured ... Miss Hacker. Linguistic ... Hiding

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/311969898

Steganographic Method for Data Hiding in Microsoft Word Documents

structure by a Change Tracking Technique

Thesis · May 2009

CITATION

1

READS

121

2 authors, including:

Some of the authors of this publication are also working on these related projects:

Steganography Approaches Based on Mix Column Transform Technique View project

both of them View project

Abdul Monem S. Rahma

University of Technology, Iraq

98 PUBLICATIONS 64 CITATIONS

SEE PROFILE

All content following this page was uploaded by Abdul Monem S. Rahma on 30 December 2016.

The user has requested enhancement of the downloaded file.

https://www.researchgate.net/publication/311969898_Steganographic_Method_for_Data_Hiding_in_Microsoft_Word_Documents_structure_by_a_Change_Tracking_Technique?enrichId=rgreq-919bd3f306b5e11c23ce4fea436e8f92-XXX&enrichSource=Y292ZXJQYWdlOzMxMTk2OTg5ODtBUzo0NDQ4NjM3MTI4OTQ5NzZAMTQ4MzA3NTE4MjgxMQ%3D%3D&el=1_x_2&_esc=publicationCoverPdf

https://www.researchgate.net/publication/311969898_Steganographic_Method_for_Data_Hiding_in_Microsoft_Word_Documents_structure_by_a_Change_Tracking_Technique?enrichId=rgreq-919bd3f306b5e11c23ce4fea436e8f92-XXX&enrichSource=Y292ZXJQYWdlOzMxMTk2OTg5ODtBUzo0NDQ4NjM3MTI4OTQ5NzZAMTQ4MzA3NTE4MjgxMQ%3D%3D&el=1_x_3&_esc=publicationCoverPdf

https://www.researchgate.net/project/Steganography-Approaches-Based-on-Mix-Column-Transform-Technique?enrichId=rgreq-919bd3f306b5e11c23ce4fea436e8f92-XXX&enrichSource=Y292ZXJQYWdlOzMxMTk2OTg5ODtBUzo0NDQ4NjM3MTI4OTQ5NzZAMTQ4MzA3NTE4MjgxMQ%3D%3D&el=1_x_9&_esc=publicationCoverPdf

https://www.researchgate.net/project/both-of-them-111?enrichId=rgreq-919bd3f306b5e11c23ce4fea436e8f92-XXX&enrichSource=Y292ZXJQYWdlOzMxMTk2OTg5ODtBUzo0NDQ4NjM3MTI4OTQ5NzZAMTQ4MzA3NTE4MjgxMQ%3D%3D&el=1_x_9&_esc=publicationCoverPdf

https://www.researchgate.net/?enrichId=rgreq-919bd3f306b5e11c23ce4fea436e8f92-XXX&enrichSource=Y292ZXJQYWdlOzMxMTk2OTg5ODtBUzo0NDQ4NjM3MTI4OTQ5NzZAMTQ4MzA3NTE4MjgxMQ%3D%3D&el=1_x_1&_esc=publicationCoverPdf

https://www.researchgate.net/profile/Abdul_Monem_Rahma?enrichId=rgreq-919bd3f306b5e11c23ce4fea436e8f92-XXX&enrichSource=Y292ZXJQYWdlOzMxMTk2OTg5ODtBUzo0NDQ4NjM3MTI4OTQ5NzZAMTQ4MzA3NTE4MjgxMQ%3D%3D&el=1_x_4&_esc=publicationCoverPdf


https://www.researchgate.net/institution/University_of_Technology_Iraq?enrichId=rgreq-919bd3f306b5e11c23ce4fea436e8f92-XXX&enrichSource=Y292ZXJQYWdlOzMxMTk2OTg5ODtBUzo0NDQ4NjM3MTI4OTQ5NzZAMTQ4MzA3NTE4MjgxMQ%3D%3D&el=1_x_6&_esc=publicationCoverPdf



Republic of Iraq Ministry of Higher Education and Scientific Research University of Technology Collage of Science Department of Computer Science

SStteeggaannooggrraapphhiicc MMeetthhoodd ffoorr DDaattaa HHiiddiinngg iinn MMiiccrroossoofftt WWoorrdd

DDooccuummeennttss SSttrruuccttuurree bbyy aa CChhaannggee TTrraacckkiinngg TTeecchhnniiqquuee

A Thesi Submitted To the Departmen

of the University of Technology in Part for the Degree of Master of Sci

By AAmmaannii YYoouussiiff AAll

Supervision PPrrooff.. DDrr.. AAbbdduull MMoonn

May 26, 2009 Jam

st of Computer Science

ial Fulfillment of the Requirementsence in Computer Science

--BBaagghhddaaddyy

Byeemm

ada

SS.. RRaahhmmaa

El Thaniah 2, 1430

حيمن الرحم الرهــل السمب

eŽ@@ŽãŽŠìÛa@áë@čpflë@čŠüač@fl@ŽÝrßŽã@čêŠì@Ù“à×ëđñč@bèîÏ@j—ßbćčÛa@Žbj—àŽ@À@đòubuŒ@

ŽÛaòubuŽ@d×@fl@bèãćk×ì×Ž@ćðŠ†Ž@Ž‡Óìí@čfl@åßđñ‹v’Ž@kßflŠđò×fl@ŽŽníŒđòãì@fl@üđòîÓ‹’fl@@üëflËđòîi‹fl@Ž†bØífl@ŽníŒŽí@bè›óŽõflë@Ûflìflflflflflflflm@@flàŽéflã@@ćŠb@

ŽãćŠìflÇ@óÜŽã@ćŠìflí@č‡èôa@ŽäÛ@Šìčêflß@flí@åfl“Žõb@flë‹›íŽla@@eألrßč@flÞ‘bäÜÛč@flë@ači@ÝØč@@

fl’øđ@flÇ@Üîćá@ @

دق اللصـــهالع ظيملي الع

سورة النور)35 (االية

AAcckknnoowwlleeddggmmeenntt

Firstly of all my great thanks to God who helped me and gave me the

ability to perform this work.

My deepest gratitude and appreciation go to my Supervisor Prof. Dr.

Abdul Monem S. Rahma for his helpful comments, his bright ideas,

technical information he provided for me, being generous with his

knowledge, who teach me exceed impossible to reach to my aim.

The guidance, advice, suggestions, kindness heart, encouragement as

well as fruitful assistance of my co-supervisor Dr. Hala B. Abdul

Wahab was of great help in finishing this Thesis.

I would like to express my gratefulness to Dr. Hilal H. Saleh Head of

computer science Department of University of Technology for offering

his encouragement.

I would like to say "thank you" to Dr. Emad K. Jabar for his parental

guidance combined with sweet objective hardness.

Special thanks and appreciation to Mr. Faiq S. Baji for his advices

and support during the period of my study, further more, this work

would not have been achieved without the support and friendship of

Esrra J. Baker and Huda Abdul Ridah AL-Safar.

I would like to thank all the staff members of Computer Science

Department specially Mss. Suham Abd in the library at the

Department. Finally, I would like to thank my family for giving me so much time

to improve myself and help me to think only of the best…………

DDeeddiiccaattiioonn

IInn tthhee nnaammee ooff GGoodd,, MMoosstt GGrraacceeffuull,, MMoosstt MMeerrcciiffuull

To the City of Science and its Teacher ………… Prophet "Mohamed"

To my injured Country………………….. …………………………….Iraq

@To the guard angle, the pure affection, school of our age and stream of

kindness who provides me with love, strength and courage, the person to

whom I am still indebted, the dearest person………….………..my Mother

To the great man who teaches me patience, and inspires me to seek the truth

and all the wonderful things I know…………………………….my Father

To those who taught me to dependent on myself to be like them, the

guidance without which my steps are aimless in the darkness, the bright

candles…………………………………………….……….…my Brothers:

(Dr. Mahmood, Dr. Ali, LT. Pilot Anwar &Stu. Ibraheem)

To who ignites my enthusiasm whenever its torch fades……..my Uncle

(Assis. Prof. Sulaiman M. Abbas Head of Electrical Eng. Dep.)

The true companions, who proved the deep meaning of friendship, who

enriched me with courage and love ……………………… my Friends:

(Huda, Dalya, Afrah, Azhar, Nuha, Issra, Zainab, Rabab, Sara, Roa'a)

To the soul of my Aunt Suham.................................................................

To everyone who helped me even with a word………………….…………..

I hope that I will be well thought of………………………………………….

The researcher

Miss Hacker@

LLiinngguuiissttiicc CCeerrttiiffiiccaattiioonn

This is to certify that this thesis entitled "Steganographic Method

for Data Hiding in Microsoft Word Documents structure by a

Change Tracking Technique" by "Amani Y. Noori " was prepared

under my linguistic supervision. Its language was amended to meet the

style of the English language.

Linguistic Supervisor

Signature:

Name: K. M. Ahmed Al-Najjar

Date: / / 2009

SSuuppeerrvviissoorr CCeerrttiiffiiccaattee

I certify that this thesis was prepared under my supervision at

Department of Computer Science in University of Technology in

a partial fulfillment of the requirements for the Master's Degree in

Computer Science.

Signature:

Name: Prof. Dr. Abdul Monem S. Rahma

Date: / / 2009

EExxaammiinniinngg CCoommmmiitttteeee CCeerrttiiffiiccaattee

This is to certify that we have read this thesis entitled, "Steganographic

Method for Data Hiding in Microsoft Word Documents Structure by a

Change Tracking Technique", and as an examining committee, examined

the student "Amani Yousif Noori", in its contents and in what is related with

it, and that in our opinion, it meets the standard of a thesis for the Degree of

Master in Computer Science at the Computer Science Department,

University of Technology with excellent grade.

Signature: Signature:

Name: Dr. Saad K. Majeed Name: Dr. Murtadha M. Hamad

(Chairman) (Member)

Date: / / 2009 Date: / / 2009

Signature: Signature:

Name: Dr. Rehab F. Hassan Name: Dr. Abdul Monem S. Rahma

(Member) (Supervisor)

Date: / / 2009 Data: / / 2009

Approved by, the Computer Science Department, University of Technology

Signature:

Name: Dr. Helal H. Saleh

Date: / / 2009

Head of Computer Science Department

Security is a request for a person … society… and world security is not a

responsibility or privilege accorded only to guards or security agents.

Information hiding research has become the focus of the information

security research because every Web sites and network communication

depend on the multimedia, such as audio, video, image and so on.

Information hiding technology can embed secret information into a

digital media source without impairing the perceptual quality of that

source; other people can’t feel this secret information.

In this thesis method is proposed for the art of data hiding by taking

advantage of the physical characteristics of computer system and how it

stores document file and treating it as a compound file. The unused Block

in this Microsoft Compound Document File Format (MCDFF) is used to

hide or conceal data. The possibilities provided by Microsoft Word

Processor program have also been utilized, such as Tools, to generate

cover for hiding.

The proposed system embeds Steganography Text in Structure (Binary

File Format) of digital and printed Text document file which is a file of

Microsoft Word Document file (Doc.) using two Processes of Hiding:

Cover Generation Process and Embedding Process.

Cover Generation Process: where the cover is a document of Microsoft

Word Document file format 2003 (doc.) and will appear to be the product

of a collaborative writing effort between Authors.

Embedding Process hiding Text string in Unused Block of Binary File

Format of that document cover.

I

This thesis introduce a system for hiding in Microsoft Word which is a

component of the Microsoft Office System and taking into account

Microsoft Office Applications it was found that Microsoft Word is less

vulnerability than other Microsoft Office Applications depending on the

last research published.

This system is implemented using Visual C sharp.NET 2003 language

with Windows XP service pack 2 as Operating System, on Laptop

computer type P4 with RAM 1GB and 2.00 GHz with Mobil Intel

processor to perform the proposed system.

II

LLiisstt ooff AAbbbbrreevviiaattiioonnss

Acronym Full Name

ASCII American Standard Code for Information Interchange

API Application Programming Interface

APIs Office Application Programming Interface

BAT Block Allocation Table

BPCS Bit Plane Complexity Segmentation

CBF Chunk Based Format

CFG Context Free Grammar

CLR Common Language Runtime

COM Component Object Model

DBF Directory Based Format

DCT Discrete Cosine Transformation

DirID Directory Identifier

DLL Dynamic-Link Library

FIB File Information Block

GIF Graphic Interchange Format

GUI Graphic User Interface

HAS Human Auditory System

HTML Hyper Text Markup Language

IEEE Institute of Electrical and Electronics Engineers

IH Information Hiding

JPEG Joint Photographic Expert Group

LSB Least Significant Bit

Mac Macintosh

MCDFF Microsoft Compound Document File Format

MSAT Master Sector Allocation Table

MSDN Microsoft Developer Network

MSDOS Microsoft Disk Operating System

OLE Object Linking and Embedding

PIA Primary Interop Assembly

PInvoke Platform Invoke

POIFS Poor Obfuscation Implementation File System

RMD Raw Memory Dumps

RTF Rich Text Format

SAT Sector Allocation Table

SBAT Small Block Allocation Table

SecID Sector Identifier

TCP\IP Transmission Control Protocol /Internet Protocol

UTF Unicode Transformation Format

VBA Visual Basic for Application

Win Windows

WYSIWYG What You See Is What You Get

XML Extensible Markup Language

LLiisstt ooff FFiigguurreess FFiigguurree NNoo.. DDeessccrriippttiioonn PPaaggee NNoo..

1.1 Information Hiding Hierarchy…………………………………. 4 1.2 Generic digital watermarking scheme………………………….. 5 1.3 Watermarking example…………………………………………. 6 1.4 A data hiding example………………………………………….. 9 2.1 Steganography basic model…………………………………….. 13 2.2 Steganography Types…………………………………………… 14 2.3 Text Hiding methods…………………………………………… 25 2.4 Color quantization……………………………………………… 30 2.5 Halftone quantization…………………………………………... 31 2.6 Huffman Tree for example…………………………………….. 35 2.7 Huffman tree for the 26-letter Alphabet……………………….. 36 3.1 Word Versions for Different Operating System……………….. 38 3.2 External Structure of a Word Document………………………. 41 3.3 Track Change Example………………………………………… 43 3.4 Comments Example…………………………………………… 43 3.5 File Structure Types……………………………………………. 45 3.6 logic view of file……………………………………………….. 47 3.7 Storage and Streams structure………………………………………… 48 3.8 Sample Word document storage format……………………….. 50 3.9 The structure of Hard Disk……………………………………. 54 3.10 MS Compound files structure………………………………… 64 3.11 Word Object Model…………………………………………. 66 3.12 Platform Invokes call to an unmanaged Dll Function…………. 67 4.1 Block Diagram for Proposed System ……………………………70 4.2 Screenshot of Microsoft Word in case of collaborative document authoring…………………………………………………………71 4.3 Author A sends a stegodocument S to a recipient B…………….72 4.4 Hiding Algorithm Flowchart…………………………………….76 4.5 Search Unused Block Algorithm Flowchart……………………. 80 4.6 Extracting Algorithm Flowchart…………………………………83

5.1 Word Reference…………………………………………………. 87 5.2 Block diagram for Unused Block path in Document file……….. 89 5.3 The main menu for the proposed system……………………….. .90 5.4 Cover Document before Track Change………………………… 90 5.5 Cover Document after Track change…………………………… 91 5.6 The Embedding Process Window………………………………. 94 5.7 Document after Hiding…………………………………………. 94 5.8 Extracting Process Window……………………………………. 95

LLiisstt ooff TTaabblleess

Table Name Description Page No.

2.1 Steganography Attacks.…………………………………….... 32 2.2 Probabilities of occurrence in English language.……………. 37 3.1 MCDFF Metadata...…………………………………………. 53 3.2 Compound document header structure……………………… 56 3.3 Header (block1)—512(0x200) bytes ……………………….. 57 3.4 Directory entry structure…………………………………….. 60 3.5 Property – 128(0x80) byte block……………………………. 61 3.6 Block Allocation Table.……………………………………... 63 3.7 Office 2003 applications and component type libraries…….. 65 5.1 Comparisons between the proposed system and other text hiding methods………………………………………………………….. 96

Glossary Terms Description

1 Byte order The order in which single bytes of a bigger data type are represented or stored.

2 Compound document

File format used to store several objects in a single file, objects can be organized hierarchically in storages and streams.

2 Compound document header

Structure in a compound document containing initial settings.

5 Control stream Stream in a compound document containing internal control data.

6 Directory List of directory entries for all storages and streams in a compound document

7 Directory entry Part of the directory containing relevant data for a storage or a stream.

8 Directory entry identifier (DirID)

Zero-based index of a directory entry.

9 Directory stream Sector chain containing the directory.

10 DirID Zero-based index of a directory entry

11 End Of Chain SecID

Special sector identifier used to indicate the end of a SecID chain.

12 File offset Physical position in a file.

13 Free SecID Special sector identifier for unused sectors

14 Header Short for “compound document header”.

15 Master sector allocation table (MSAT)

SecID chain containing sector identifiers of all sectors used by the sector allocation table.

16 MSAT SecID Special sector identifier used to indicate that a sector is part of the master sector allocation table.

17 Red-black tree Tree structure used to organise direct members of a storage.

18 Root storage Built-in storage that contains all other objects (storages and streams) in a compound document.

19 Root storage Directory entry representing the root storage.

22 SecID Zero-based index of a sector (short for “sector identifier”).

23 SecID chain An array of sector identifiers (SecIDs) specifying the sectors that are part of a sector chain and thus enumerates all sectors used by a stream.

24 Sector Part of a compound document with fixed size that contains any kind of stream (user stream or control stream) data.

No. Subject No. of page

1 Chapter One : General Introduction and Survey

1.1 Introduction 1 1.2 Information Hiding History 2 1.3 Information Hiding Hierarchy 4

1.4 The Difference between Cryptography, Steganography and Watermarking 6

1.5 Information Hiding Applications 7 1.6 Literature Survey 9 1.7 Aim of Thesis 11 1.8 Thesis Outlines 12

2 Chapter Two : Steganography 2.1 Introduction 13 2.2 Steganography Basic Model 13 2.3 Steganography Types 14 2.3.1 Pure Steganography 14 2.3.2 Secret Key Steganography 15 2.3.3 Public Key Steganography 16 2.4 Steganography Algorithms 16 2.4.1 Spatial Domain Based Steganography 16 2.4.2 Transform Domain Based Steganography 17 2.4.3 Document Based Steganography 18 2.4.4 File Structure Based Steganography 18 2.5 Steganography Under various Media 18 2.5.1 Hiding in Disk Space 18 2.5.2 Hiding in Network Packets 19 2.5.3 Hiding in Software and Circuity 20 2.5.4 Hiding in Video 20 2.5.5 Hiding in Audio 20 2.5.6 Hiding in Image 21 2.5.7 Hiding in Text 21 2.6 Classification of Text Hiding Techniques 21 2.7 Steganalysis 31 2.8 Attacks are available to the Steganalyst 32 2.9 Introduction to the code 33 2.10 Why Encode the Data 33 2.11 Huffman Coding 34

3 Chapter Three: Microsoft Word Document File

3.1 Introduction 38 3.2 History of Word 39 3.3 Microsoft Word Document and its Components 41 3.4 Annotation and collaboration Tools 42 3.4.1 Track Changes 42 3.4.2 Comments 43 3.5 File Format 44 3.6 Identify the Type of a File 44 3.6.1 Filename Extension 44 3.6.2 Magic Number 45 3.7 File Structure 45 3.7.1 Raw Memory Dumps/Unstructured Formats (RMD) 46 3.7.2 Chunk Based Formats (CBF) 46 3.7.3 Directory Based Formats (DBF) 46 3.8 Structure Storage 47 3.9 Microsoft Compound Document File Format(MCDFF) 49 3.10 Structure of a Word Documents files 50 2.11 Format of the Main Stream 52 3.12 MCDFF metadata 53 3.12.1 Compound Document Header 55 3.12.2 Byte Order 58 3.12.3 Sector File Offset 59 3.12.4 Property Table (Directory) 59 3.12.5 Block Allocation Table (BAT) 62 3.12.6 Sector Allocation Table (SAT) 64 3.13 Office Automation 64 3.14 PIA for Microsoft Office 2003 65 3.15 Word Object Model 65 3.16 Platform Invoke (PInvoke) 67 3.17 Application Programming Interface (API) 68 3.18 Office Application Programming Interface (APIs) 68

4 Chapter Four : Proposed Hiding System in Document File

4.1 Introduction 69 4.2 Cover Generation Process 71 4.3 Embedding Process 73

5 Chapter Five : Experimental Results and Discussion

5.1 Introduction 84 5.2 System Implementation 90

5.2.1 Document before Hiding 90 5.2.2 Embedding Process 91 5.2.3 Document after Hiding 94 5.2.4 Extracting Process 95

5.3 Comparisons between proposed system and the most popular hiding methods 96

6 Chapter Six : Conclusions and Suggestions for Future Work

6.1 Conclusions 97 6.2 Suggestions for Future Work 98 Glossary References I Appendix A II Appendix B III Appendix C

11 CChhaapptteerr OOnnee

ZZxxÇÇxxÜÜttÄÄ \\ÇÇààÜÜÉÉwwââvvàà||ÉÉÇÇ ttÇÇww

ffââÜÜääxxçç

Survey and Introduction General neO Chapter

1

CChhaapptteerr OOnnee

GGeenneerraall IInnttrroodduuccttiioonn aanndd SSuurrvveeyy 11..11 IInnttrroodduuccttiioonn [XIU06]

h

t

i

e development of the Internet, information processing

echnologies and the rapid development of communication, the

mages, audio, video and other multimedia information can be

rapidly transmitted in variety of communication networks, so it can provide

greater convenience to compression, storage, and reproduction processing

applications. At the same time, it is convenient to share information

resources, and the network has become the main means of communication.

Now, all confidential information, including national security information,

military information, and personal information (such as credit card

numbers), are required for transmission through the network, but the

Internet is an open environment, so information security has become

increasingly important today.

TT

Information security technology has two main branches: cryptography

and information hiding. Cryptography was widely used in various

industries. There have been many years of research in encryption

technology and there are many encryption algorithms. But the encryption

technology can clearly inform users that the documents or other media have

been encrypted, the attacker can use a variety of tools to attack the secret

information. Although the technique of encryption developed rapidly, but

the attacker’s tool is also strengthened. It is the so-called “instructors

always keep one step ahead”. Because of the rapid development of


2

computer capabilities, some limitations already appear in the application of

encryption technology. This makes people pay more attention to the other

main branch of information.

The purpose of the traditional encryption technology is to conceal the

content, so the encrypted documents are difficult to read.

11..22 IInnffoorrmmaattiioonn HHiiddiinngg HHiissttoorryy

Hiding messages is nothing new over the past years; multitudes of

methods have been used to hide information. One of the first documents

describing steganography is from the histories of Herodotus. In ancient

Greece, the text was written on wax covered tablets. To avoid capture, he

scraped the wax off the tablets and wrote a message on the underlying

wood. He then covered the tables with wax again. The tables appeared to

be blank and unused so they passed inspection by sentries without question

[JOH99].

Historically various steganographic techniques have been used

including:

I. Tattoo. A Roman general that shaved the head of a slave

tattooing a message on his scalp. When the slave's hair grew

back, the general dispatched the slave to deliver the hidden

message to its intended recipient [DIC07].

II. Character marking. Select letters of printed or typewritten text

are over written in pencil. The marks are ordinarily not visible

unless the paper is held at an angle to bright light [DOB97].

III. Invisible ink. From the 1st century through World War II

invisible inks were often used to conceal hidden messages. A

number of substances (milk, vinegar, fruit juices and urine) can


3

be used for writing. They leave no visible trace until heat or some

chemical is applied to the paper.

IV. Pin punctures. Small pin punctures on selected letters are

ordinarily not visible unless the paper is held up in front of a

light [DOB97].

V. Microfilm. While Paris was under siege in 1870, messages were

sent by carrier pigeon. A Parisian photographer used a microfilm

technique to enable each pigeon to carry a higher volume of data

[DIC07].

VI. Null ciphers (unencrypted message) were also used. In this

method the first letter of each word spells out a message. But

messages are very hard to construct [KAH96].

The following message was actually sent by a German spy during

Second World War [RIM97].

"Apparently neutral's is thoroughly discounted and ignored.

Isman hard hit. Blockade issue affects pretext for embargo

on by- products, ejecting suets and vegetable oils".

Decoding this message by taking the second letter in each word reveals

the following secret message:

"Perishing sails from NY June 1".


4

11..33 IInnffoorrmmaattiioonn HHiiddiinngg HHiieerraarrcchhyy Information Hiding (IH) is a kind of technique in the area of

information security. It is a technique to secretly embed information into

digital contents such as images, audios, movies, document, so that it cannot

be visually or audibly perceived, a data hiding example can be shown in

figure (1.4) [YOS06].

The Terminology which was agreed at first international workshop on

this subject in Figure (1.1) [CAC98]:

Covert channels in the context of multilevel secure systems (e.g.

military computer systems),as communication paths were neither

designed nor intended to transfer information at all these channels

typically used by untrustworthy programs to leak information to their

owner while performing a service for another program [KAT00].

Anonymity is finding ways to hide the Metacontent of messages,

that is, the sender and the recipients of a message [KAT00].

IH

Copyright marking

Steganography Anonymity Covert channels

Robust Fragile Copyright Watermarking

fingerPrinting

Watermarking

Figure (1.1) Information hiding hierarchy


5

Steganography an important sub discipline of information hiding is

art and science of communicating in a way which hides the existence

of the communication [KAT00].

Fingerprinting is a term that denotes special applications of

watermarking. It relates to watermarking application which

information such as the creator or recipient of digital data is

embedded as watermarks [KAT00].

In contrasting to Steganography, Copyright marking guarantees

that embedded data can be reliably detected after the image has been

modified (but not destroyed beyond recognition) [CAC98].

Watermarking is the process of embedding information into digital

multimedia content such that the information (which we call the

watermark) can later be extracted or detected for a variety of

purposes including copy prevention and control, an example of

watermarking can be shown in figure(1.3) [BAK05].

Watermark host Data Watermark Data secter/public key (K)

Marking Algorithm

Figure (1.2) Generic digital Watermarking scheme [KAT00]

There are several approaches to classify watermarking systems. One could

categorize them according to the watermarking powerful against types of

attack.


6

Fragile Watermarks are watermarks that have only very limited

robustness. The embedded watermarks will change, or disappear, if a

watermarked object is altered. This type of watermark can be used

for authentication purpose to verify the originality of watermarked

object [BAK05].

Robust watermarking is designed to survive "moderate to severe

signal processing attacks". In such a way that any signal transform of

reasonable strength cannot remove the watermark. Robust

watermarks are public able in image copyright protection and

fingerprinting [BAK05].

Figure (1.3) watermarking example [ROC08]

11..44 TThhee DDiiffffeerreenncceess bbeettwweeeenn CCrryyppttooggrraapphhyy,, SStteeggaannooggrraapphhyy aanndd WWaatteerrmmaarrkk.. The cryptographer's interest is primarily with obscuring the content of

a message, but not the communication of the message. The steganographer,

on the other hand is concerned with hiding the very communication of the

message, while the digital watermarked attempts to add sufficient metadata

to a message to establish ownership, provenance, source, etc. Cryptography

and steganography share the feature that the object of interest is embedded,


7

hidden or obscured, whereas the object of interest in watermarking is the

host or carrier which is being protected by the object that is embedded,

hidden or obscured. Further, watermarking and steganography may be used

with or without cryptography; and imperceptible watermarking shares

functionality with steganography, whereas perceptible watermarking does

not [BER06].

11..55 IInnffoorrmmaattiioonn HHiiddiinngg AApppplliiccaattiioonnss [XIU06] The advantages of information hiding technology have been applied in

many prospects, including e-commerce, electronic transaction protection,

confidential communications, copyright protection, copy control, operation

tracking, authentication, and signature fields.

A recent research shows that the following applications of information

hiding stimulated people’s research interest:

I. Military organization and other intelligence agencies need secret

communication. In the modern battlefield when the sensitive

signal detection may lead to the rapid release of the attacks, the

military often used communications preparation or distribution of

atmospheric scattering of spectral transmission technology to

ensure accurate signal transmission.

II. Terrorists are also studying the use of information hiding

technology. Through research, the US anti-terrorist organizations

analysis that in the September 11 incident, the terrorists used

steganograhpy technique, which embed the instructions into

multimedia (such as images) and transmitted in Internet If there

were no hidden writing specialized analysis tools, it is difficult to

detect concealing write processed pictures.


8

III. As the electronic-commerce is springing up, information

security becomes more important. In addition to encryption

technique, people are more concerned about the hidden message

authentication techniques.

The extensive application of information hiding technology can be

roughly categorized as follows:

Secret communications: it hided the communications process

and the communicators.

Copyright protection: authorized Watermark perceived to be

embedded in the way of multimedia.

Testing and certification: digital works could be carried out

certification, and to tamper with a test.

Piracy tracking: used to track the author or some backup

buyers.

Information identified: some of the information is hidden in

the carrier medium, in order to interpret some elements about

the medium.

Reproduction control and access control: with embedded

digital watermarks to express some of the access control

system.

Information control: using information technology to control

certain information.

Bills security: Bills security is to make sure that the hidden

watermarks on the bills could still exist after printed. It can

guarantee the authenticity of the bills.


9

Message to be hidden The cover image The prodece stego image

Figure (1.4) a data hiding example [ROC08] 11..66 LLiitteerraattuurree SSuurrvveeyy The following is a review of different works used in environment:

I. Abdul Wahab, H., B., 2001, [ABD01] "Information Hiding in

written Text Using Context Free Grammar (CFG) ", this work embeded

text (English text) after being constructed according to CFG in another text

(English Text). The proposed system gives good results and can be applied

in several cases in life when sending encrypted message that draws

suspicions.

II. Al-Shamkhy, R., A., 2001, [ALS01] "Hiding Text in Text Using

Dictionary Method", This Thesis proposed a system that uses the text

media to embed its secret file text depending on a dictionary. This

dictionary contains English words sorted in an alphabetical order to be


10

selected by user in order to build the cover message. The receiver does not

need this dictionary, this will decrease the amount of information which is

needed on the receiver side and this will increase the security of the

proposed system.

III. Al-Saady, B., Y., 2005,"Document Protection Using Digital

Watermarking ", in this thesis, four methods are suggested to embed a

watermark in a document created by Microsoft word program. The two

types of watermarking suggested are visible as a background, and invisible

watermark that depends on the macro technique. The ability of macro

program to run with document helps us to use the macro program to control

the watermarking operation. There are three suggested methods to use the

macro program as a tool to protect both watermark and document from the

unauthorized modification. These methods are powerful methods to protect

both watermark and document when applied to Microsoft word document.

IV. Al-Abaichi, A., M., 2005,"Analyzing and Detecting Information

Hiding in Computer Printed Text", the proposed system is used to

analyze and detect hidden information in the printed text after converting it

to a gray scale image consisting of two phases, analysis and detection. In

the first phase, the boundary of the text image, the baseline from two sides,

beginning and ending with each (line, word, and character) are fixed, the

gaps between words and at the ending of lines are determined and No. of

line, No. of words No. of characters and No. of gaps between words are

calculated. Each detection phase deals with mainly four methods used for

hiding the secret message in a format text such as line-shift(up, down),

open space method (inter-word-space, and of line space and inter-sentence-

space),word-shift (horizontal) and feature code (shorten or lengthen the

upward, shorten or lengthen the downward) of the character.


11

V. Eckstein, K. and Jahnke, M. 2005, "Data hiding in Journaling

File Systems", this article structures and compares existing data hiding

methods for UNIX file systems in terms of usability and countermeasures.

It discusses variant techniques related to advanced file system and proposes

a new technique that stores substantial amounts of data inside journaling

file systems in a robust fashion with low delectability, which is

demonstrated by means of a proof-of-concept implementation for the exit

journaling file system.

VI. Lie, T., Y., and Tsai W., H., 2007, [LIU07] "A New Steganography

Method for Data Hiding in Microsoft Word Documents by a Change

Tracking Technique", this research proposed method for hiding by taking

text segments in the document and degenerated, mimicking to be the work

of an author with inferior writing skills, with the secret message embedded

in the choices of degenerations. The degenerations are then revised with the

changes being tracked, making it appear as if a cautious author is correcting

the mistakes.

11..77 AAiimm ooff TThheessiiss The aim of this thesis is to use Information Hiding Technology to

embed Text in structure (Binary File Format) of digital and printed Text

document which is Microsoft Word Document file 2003 using

Steganography method.

This can be achieved by the following:

The cover document which is a Document of Microsoft Word

Document 2003 is made to be the product of a collaborative

writing effort between many authors to avoid drawing

suspensions that there is hidden data in document.


12

11..88 TThheessiiss OOuuttlliinneess This thesis begins with an introduction to information hiding technique

and its hierarchy.

CChhaapptteerr TTwwoo: "Steganography ", presents a general description of

Steganography, Text hiding methods and Huffman Encoding.

CChhaapptteerr TThhrreeee: "Microsoft Word Document File Format" introduces a

complete description about Microsoft Word Document the software and its

file format and structure.

CChhaapptteerr FFoouurr: "Proposed Hiding System in Microsoft Compound

Document file Format ", presents a Cover generation process, MCDFF

metadata and Hiding processes.

CChhaapptteerr FFiivvee:: "Experiment Results and Discussion" introduces a

complete description about the proposed method implementation and

results.

CChhaapptteerr SSiixx:: "Conclusions and Suggestions for Future work ", presents

the derived conclusions and some suggested ideas for future work.

22

CChhaapptteerr TTwwoo

ffààxxzzttÇÇÉÉzzÜÜttÑÑ{{çç

Chapter Two Steganography

CChhaapptteerr TTwwoo

SStteeggaannooggrraapphhyy

22..11 IInnttrroodduuccttiioonn he word Steganography comes from two roots in the Greek

language, "Stegos" meaning hidden/covered or roof, and

"Graphia" simply means writing [KRE04].

The Goal of Steganography is to hide message inside other harmless

message in a way that does not allow any enemy to even detect that there is

a second secret message present (to avoid drawing suspensions) [KAT00].

T TSteganography uses the illusion of normality to mask the existence of

covert activity. The illusion is manifested through the use of a myriad of

forms including written documents, photographs, paintings, music, sounds,

physical items, and even the human body. Two parts of the system are

required to accomplish the objective, successful masking of the message

and keeping the key to its location and/or deciphering a secret [DIC07].

22..22 SStteeggaannooggrraapphhyy BBaassiicc MMooddeell

Stego KeyStego Key

13

Figure (2.1) steganography basic model

EmbeddingProcess

Cover

ExtractingProcess

Message to hide

Stego Cover

Cover

Hidden Message


14

The message inside a Cover ((or

is used to extract secret message from a carrier.

.

2.3 Steganography Types There

Figure (2.2) Steganograhy Types

2.3.1 Pure Steganography [KAT00]

A steganography system which does not require the prior

embedding and extracting algorithm.

Each data hiding Method consists of:

I. Embedding Process.

II. Extracting Process.

Embedding Process is used to hide secret

carrier).The Cover carrier and the embedded message create a stego-

carrier.

The Extracting Process

Hiding information may require a stegokey or password that is additional

secret information so that only those who possess the secret keyword can

access the hidden message.

Cover medium + Embedded massage+ Stegokey = Stego- medium

2.3 Steganography Types are basically three types of steganographic protocols

described in the following figure:

Secret key hSteganograp

Steganography

Pure Ste aphganogr

Public Key hSteganograp

exchange of some secret information (like stego-key) is called a pure

Steganography. Both sender and receiver must have access to the


Definition: (Pure steganography)

The quadruple б = < C, M, D, E >, where C is the set of possible covers,

M the set of secret messages with |C| ≥ | M |,

E: C × M → C the embedding function, and

D: C→ M, the extracting function,

With the property that D (E(c, m)) = m for all m ∈ M and c C is called a

Secret key steganography is defined as a steganographic system

that requires the exchange of a secret key (stego-key) prior to

communication. Secret key steganography takes a cover message and

The quintuple б = < C, M, K, D, E >, where C is the set of possible covers,

|, K the set of secret keys,

E k: C ×M ×K → C and

∈

pure steganography system.

2.3.2 Secret Key Steganography

embeds the secret message inside it by using a secret key (stego-key). Only

the parties who know the secret key can reverse the process and read the

secret message. Unlike pure steganography where a perceived invisible

communication channel is present, secret key steganography exchanges a

stego-key, which makes it more susceptible to interception. The benefit of

secret key steganography is even if it is intercepted; only parties who know

the secret key can extract the secret message [DUN02].

Definition: (Secret Key Steganography)

M the set of secret messages with |C| ≥ | M

15


Dk: C × K→ M

With the property that Dk (Ek(c, m, k), k) = m

For all m M, c C and k∈ ∈ ∈ K, is called a secret key steganographic

n

nge of secret key. Public key steganography system

requires the use of two keys, one private and one public key; the public key

eas the public key is used in the

ain based steganography;

cludes LSB (Least Significant Bit)

CS (Bit Plane Complexity Segmentation) algorithm.

he spatial methods are most frequently employed by steganography tools

bec hidden information and

system [KAT00].

2.3.3 Public Key Steganography

As i public key cryptography, public key steganography does not

rely on the excha

is stored in a public database, wher

embedding process, the secret key is used to reconstruct the secret

message[KAT00].

22..44 SStteeggaannooggrraapphhyy AAllggoorriitthhmmss Stegaongraphy Algorithms are classified according to five categories:

(1). Spatial domain based steganography;

(2). Transform dom

(3). Document based steganography;

(4). File structure based steganography;

(5). Other categories.

2.4.1 Spatial Domain Based Steganography

Spatial steganography mainly in

steganography and BP

T

ause of fine concealment, great capability of

easy realization [MIN06].

16


LSB Replacement & Matching

Least Significant Bit (LSB) which replaces the least significant bit

some bytes of the cover file to hide a sequence of bytes which contains

e hidden data, LSB steganography includes two schemes:

Seque bedding. Taking images as

exa

to control the

to the same size pixel-blocks. The

BPCS’s capacity can reach 50% of the cover image data. However, the

information, but they are highly vulnerable to even small cover

modification. An attacker can simply apply signal processing techniques in

s

t to various kinds of signal

in

th

ntial embedding and scattered em

mple, sequential embedding replaces the pixels’ LSBs with the message

one by one sequentially. Scattered embedding makes message randomly

scatter over the whole image by a random sequence

embedding places.

BPCS Steganography

As the approach of bit-replacing in LSB steganography, BPCS

steganography hides secret data by the way of block-replacing, each bit

plane of the image is segmented in

large capacity embedding will bring more influence to the image [MIN06]. 2.4.2 Transform Domain Based Steganography [KAT00]

The LSB modification techniques are easy ways to embed

order to destroy the secret information entirely.

Transform domain methods hide messages in significant area of the

cover image which makes them more robust to attacks, such as

compression, cropping, and some image processing, than the LSB

approach. However, while they are more robus

processing, they remain imperceptible to human sensory system.

Many transform domain variations exist. One method is to use the

discrete cosine transformation (DCT).

17


2.4.3 Document based Steganography

This kind of tools embeds data in document files by adding tabs or

spaces to .txt or .doc files. One of the provided steganographic tool is

Software called Snow Snow embeds data in .txt files by adding tabs and

and the spaces are segmented with a tab. So the number of secret bits

should be a multiple of 3, otherwise they would be filled up with 0 bits.

the

isual/aural Attack and the statistical detection [MIN06].

The onset of computer technology and the internet has given new life to

steganography and the creative methods with which it is employed.

carriers [JOH01].

aking advantage of unused or

reserved space to hold covert information provides a means of hiding

spaces at the end of text line. Every 3 bits are encoded with 0 to 7 spaces

2.4.4 File structure based Steganography

Structural embedding inserts secret data in the redundant bits of

cover file, such as the reserved bits in the file header or the marker

segments in the file format. This makes hidden data immune to

v

22..55 SStteeggaannooggrraapphhyy uunnddeerr VVaarriioouuss MMeeddiiaa

Computer-based steganographic techniques introduce changes to digital

carriers to embed information foreign to the native

Carriers of such message may resemble innocent sounding text, disks

and storage devices, network traffic and protocols the way software or

circuits are arranged, audio, images, video, or any other digitally

represented code or transmission [JOH01].

2.5.1 Hiding in Disk space [MIK07]

Another way to hide information relies on finding unused space that

is not readily apparent to an observer. T

18


information without perceptually degrading the carrier. The way operation

systems store files typically results in unused space that appears to be

allocated to files. Another method of hiding information in file system is to

create a hidden partition. These partitions are not seen if the system is

tarted normally. However, in many cases, running a disk configuration

e in rnet. Any

of these packets can provide a covert communication channel. The packet

hat can be manipulated to hide

s

utility exposes the hidden partition. These concepts have been expanded in

a novel proposal of a steganographic file system. If the user knows the file

name and password, then access is granted to the file; otherwise, no

evidence of the file exists in the system of the hidden files.

2.5.2 Hiding in Network packets [JOH01]

Various network protocols have characters that can be used to hide

information. TCP/IP packets are used to transport information; an

uncountable number of packets are transmitted daily over th te

headers have unused space or other values t

information. However, filters can be set to detect information in the

"unused" or reversed spaces. One way to circumvent this detection is to

take advantage of information in the headers that typically go unchecked

by most systems. Such information includes the values for sequence and

identification numbers.

19


2.5.3 Hiding in software and circuitry

Data can also be hidden based on the physical arrangement of a

carrier. The arrangement itself may be an embedded signature that is

nique to the creator. An example of this is in the layout of code distributed

circuits on a board, this type of

"marking" can be used to uniquely identify the design origin and cannot be

mov

hide data. Due to the size of video files, the scope

for adding lots of data is much greater and therefore the chances of hidden

e and a range of frequencies greater than one

thousand to one making it extremely hard to add or remove data from the

u

in a program or the layout of electronic

re ed without significant change to the network [JOH01].

2.5.4 Hiding in video

For video, a combination of sound and image techniques can be

used. This is due to the fact that video generally has separate inner files for

the video (consisting of many images) and the sound. So techniques can be

applied in both areas to

data being detected is quite low [CUM04].

2.5.5 Hiding in Audio

Data hiding in audio signals is especially challenging, because the

Human Auditory System (HAS) operates over a wide dynamic range. To

put this in perspective, the (HAS) perceives over a range of power greater

than one million to on

original data structure. The only weakness in the (HAS) comes at trying to

differentiate sounds (loud sounds drown out quiet sounds) and this is what

must be exploited to encode secret messages in audio without being

detected [DUN02].

20


2.5.6 Hiding in Image

Given the proliferation of digital images, especially on the Internet,

nd given the large amount of redundant bits present in the digital

age, images are the most popular cover objects for

steganography [MOR00].

s as hosts for steganographic messages takes

advantage of the limited capabilities of the human visual system. Encoding

Important point must be said that the embedding task in text requires

user; it therefore cannot be automated, while image

and audio can embed the data directly and automatically according to its

ways have been proposed to hide information directly in text

Syntactic method: where the structure of sentences is transformed

a

representation of an im

Using image file

extra data in an image file changes pixels in the image, but these changes

would remain imperceptible to the human eye [BER05].

2.5.7 Hiding in Text

Written Text can be used as a method to transmit secret messages.

Only small amounts of data can be hidden when hiding data in text. Thus,

this method is known to have a common low data rate.

the interaction of the

algorithm.

22..66 CCllaassssiiffiiccaattiioonn ooff TTeexxtt HHiiddiinngg TTeecchhnniiqquueess::-- Steganograhy methods can try to encode the information directly in the

text or in the text format as shown in figure (2.3).

I. Encoding Information Directly in the Text

Many

like Syntactic, Semantics, P.Waynar, Chapman, Translation and HTML.

without significantly altering their meaning. This method utilizes

punctuation, diction [VIL06].

21


ample of using punctuation: Ex

Th

consid

appears before the "and" this represents as a "1" and the second phrase

represents as a "0"[ALS01].

ructure of the text:

e sentence this will encode as a "1",when an

is will be encoded as a

ilobytes of text,

ader and changing the

be considered primary and the word "large" is

ver, syntactic and semantic methods are not suitable for all types

ents, literary texts) and need,

e phrase "bread, butter, and milk" and "bread, butter and milk" are both

ered correct usage of commas as a list, such that when the comma

Example of using Diction and st

The sentence "Before the night is over, I will finish" and

The sentence "I will finish before the night is over"

This method is more transparent than the punctuation method .When a verb

comes at the beginning of th

adverbial comes at the beginning of the sentence th

"0"[ALS01].The expected data rate only several bits per k

use of punctuation is noticeable to even casual re

punctuation will impact the clarity and even the meaning of the text so this

can be considered as a Disadvantage of using Punctuation.

Semantics Method

Where words are replaced by their synonyms and/or sentences are

transformed via suppression or inclusion of noun phrase coreferences

[VIL06].

Example of using Semantic Method

The word "big" could

considered secondary. Decoding primary words will be read as ones,

secondary words as zero [ALS01].

Howe

of documents (e.g. contracts, identity docum

in general, human supervision [VIL06].

22


P.Wayner Method

Peter Wayner proposed a Mimic Function which exploits the

tatistical profile of a message, since the stego-objects are created only

ccording to statistical profile, the semantic component are entirely

nored.

Wayner described one of the most promising techniques, he uses

(CFG) to create cover-text and chooses the productions according to the

chniques [KAT00].

complished by the use of a parse tree for the

T and SCRAMBLE. Given a large dictionary of

ords out of the

s

a

ig

secret message to be transmitted, the secret information is not embedded in

the cover, and the cover itself is the secret message. If the grammar is

unambiguous the receiver can extract the information by applying standard

parsing te

Wayner proposed an extension to the technique of mimic function,

given a set of production, assigning a probability to each possible

production. The sender then constructs a Huffman compression function

and converts the secret message to a binary bit. The receiver then parses the

cover in order to reconstruct the productions which have been used in the

embedding step; this can be ac

given CFG [ALS01].

But the vulnerable aspect of this technique is difficult to select

meaningful type categories without considering the eventual grammatical

requirements of a natural-language style-source [ALD05].

Chapman and Davida Method

Chapman and Davida proposed a system which consists of two

functions, NICETEX

words of different types, and a style source, which describes how words of

different types can be used to form a meaningful sentence, NICETEXT

transforms secret message bits into sentence by selecting w

23


dictionary which conform to a sentence structure given in style source

[ALS01].

SCRAMBLE reconstructs the secret if the dictionary which has been used

is known. Style resources can either be created from natural-language

entence or be generated using CFG [ALS01].

he most obvious problem with the manual method is that it takes too long

s with thousands of words [ALD05].

tion process, especially in

resulting from translation-

ased steganography are inconspicuous. The translation-based approach,

how s [LIU07].

ed until the source

f the page is revealed [KAT00].

s

T

to enter large lists. Nicetext focuses on creating large, sophisticated

dictionarie

Translation- based steganography

Use the expected errors in the transla

machine translation, to solve the issue of producing implausible text;

information is hidden in the noise that occurs in language translation. In

cases where sending imperfect translations to a

b

ever, may be vulnerable to active attack

HTML

Information is hidden in HTML files by adding useless spaces and

line breaks or by changing the case of letters in the tags [JOH98].

Html files are good candidates for including extra spaces but Web

browses ignore these "extra" spaces and they go unnotic

o

24


Figure (2.3) Text hiding method

Text Hiding

Techniques

Encoding Information Directly inThe Text

Encoding Information

In The Tex Format

Semantic

method

Syntax

method

P.Wayner

method

ChapmanDaivdeamethod

Feature

encoding

Line-shift

encoding

Word-shift

encoding

Open-space

encoding

Binary code

Binary

code

Binary code

Binary code

Binary code

Binary code

Binary code

Binary code

Translation based

Steganography

HTML

Color quantizati

on

Halftonequantizat

ion

Binary code

Binary code

Binary code Binary

code

25


43


II. Encoding Information in the Text Format [ALS01].

Information can be embedded in the format rather than in the

message itself. secret information can be stored in the size of inter-line

or inter-word spaces. If the spaces between two lines are smaller than

some threshold, a "0" is encoded; otherwise a "1" is encoded. Infrequent

additional white space characters are introduced to form the secret

message.

Open Space method

Encode through manipulation of white space (unused space) on the

printed page. There are three methods for using white space to encode

data.

Inter-Sentence Spacing [ALS01].

This method deals with encoding a binary message into a text by

placing one or two spaces after the sentence, such that one space

represents "0" and two spaces represent "1".

The disadvantage of this method is that it is insufficient, requiring a

great deal of text to encode a very few bits(one bit per sentence).This

equates to a data rate of approximately one bit per 160 bytes assuming

sentences are on average two 80 character lines of text. Its ability to

encode depends on the structure of the text and many word processors

automatically set the number of spaces after periods to one or two

characters.

A. End-of-line spaces [ALS01].

This method deals with inserting spaces at the end of lines. The data

are encoded allowing for a predetermined number of spaces at the end

of each line. This method has a number of advantages in that it goes

unnoticed by readers and the amount of hidden information is maximum

26


than inter-sentence method and the disadvantage like some programs

like "sendmails" may in advertently remove the extra space characters.

B. Inter-Word-Spaces [ALS01].

Using the white space to encode data involves right justification of

text. One space between words is interpreted as a "0".Two spaces are

between words are interpreted as a "1". This method has a number of

advantages like changing the number of trailing space, there is little

chance of changing the meaning of a phrase or sentence and the casual

reader is unlikely to take notice of slight modifications in white space.

The disadvantage is that if the reader does not notice its manipulation,

then the word processor may inadvertently change the number of

spaces, destroying the hidden data.

Line-Shift Coding

In this method, text lines are vertically shifted (moved up or down)

according to the secret message bits, whereas other lines are kept

stationary for the purpose of synchronization. If a line is moved up, a

"1" is encoded; otherwise a "0" is encoded [DUC01].

The disadvantage of this method is that it represents the most visible

text coding technique to the reader; large documents encode a few bits

(one bit per line) and the need for the original message may decrease the

security of the system [ALS01].

Word-shift Coding [ALS01]

In this method, codewords are coded into a document by shifting the

horizontal or vertical locations of words within text lines, while

maintaining a natural spacing appearance.

This method is only applicable to documents with variable spacing

between adjacent words.

27


as a result of this variable spacing, it is necessary to have the original

image, or to at least know the spacing between words in the un encoded

document.

A. Encode Codeword (Horizontal Shift- Word)

For each text line, the largest and the smallest spaces between words

are found. It is possible to alter every space between two words

[ALS01].

For example take the Sentence1:

We explore new steganographic and cryptographic algorithms and

techniques throughout the world to produce wide variety and security

in the electronic web called the Internet

Applying some horizontal shifting word algorithm to obtain the

following sentence

Sentence 2:



in the electronic web called the Internet.

By overlapping the two sentences, obtain the following:


techniques throughout the world to produce wide variety and

security in the electronic web called the Internet.

This is achieved by expanding the space before wide, web by one point

and condensing the space after explore, the world by one point in

sentence1,the sentences containing the shifted words appear harmless,

but combining this with the original sentence produces a different

message: explore the world wide web.

In the same method, can encode binary message instead of encoded

word. For example, if expand the space before explore, the world,

28


wide, web by one point, this will be encoded as "1", and if condense

the space after explore, the world, wide, web by one point, this will be

encoded as "0".

By applying random horizontal shifts to all words in the document, an

attacker could eliminate the encoding.

B. Encode Codeword (Vertical Shift- Word)

Shifting the vertical locations of words can be used to help identify

an original document. A similar method can be applied to display an

entirely different message [ALS01].

For example take the following sentence:




Applying some vertical shifting word algorithm to obtain the following

sentence:




In the same method, can encode binary message instead of encoded

word. For example, if shift up the words explore, the world, this will

be encoded as "1", and if we shift down the words wide, web this will

be encoded as "0".

Feature Encoding

Where feature such as Shape, Size, or Position are manipulated .In

this method certain text features are altered, or not altered depending on

the codeword. For example, one could encode bits into text by

extending or shortening the upward, vertical end lines of letters such as

29


b, d, h, etc. generally before encoding, feature randomization takes

place. Character end line lengths would be randomly lengthened or

shortened, then altered again to encode the specific data. This removes

the possibility of visual decoding, as the original end line lengths would

not be known to code, one requires the original image.

Examples of using feature coding

Long d can be decoded of as "1" short d can be decoded as "0".

Long h can be decoded of as "1" short d can be decoded as "0".

Long b can be decoded of as "1" short d can be decoded as "0".

This method has a number of advantages like high amount of data

encoding, largely indiscernible to the reader; the disadvantage is that the

feature coding can be defeated by adjusting each endline length to fixed

value [ALS01].

Color quantization [VIL06]

The main idea of this method is to quantize the color or luminance

intensity of each character in such a manner that the human visual

system is not able to distinguish between the original and quantized

characters, but it can be easily performed by a specialized reader

machine. An example illustrating this method is shown in Figure (2.4).

Therein, dark characters encode a 0, whereas light ones encode a 1. A

binary sequence can be sequentially embedded into the cover text.

Notice that the embedding rate is comparatively higher than the rate of

inter-line or inter-word space modulation methods.

VAMOS A TRABAJAR

(a)

VAMOS A TRABAJAR

0 1 0 1 1 0 0 1 0 0 0 1 0 1

(b)

Figure (2.4) .Color quantization: (a) original text; (b) marked text (exaggerated)

30


Halftone Quantization [VIL06]

This method relies on half toning, a widely used printing technology

that enables continuous tone images to be printed with one color ink

(grayscale) or a few color inks (color). Here, the discussion is restricted to

black & white printers.

In order to simulate a given gray shade a halftone printer uses a

halftone screen. This method exploits the fact that there exist several

possible choices for the halftone screen leading to the same gray shade.

Therefore, one can use this property in order to hide data on each text

character by using a different halftone screen according to the message m

that wishes to embed. The major strength of this method is that all

characters in the stego text will have the same grade shade. This method is

intended mainly for printed documents.

(a) (b) (c)

Figure (2.5) Halftone quantization: (a) Original character; (b) marked character for m = 0;

(c) Marked character for m = 1.

22..77 SStteeggaannaallyyssiiss A goal of steganography is to avoid drawing suspicion to the

transmission of hidden message. If suspicion is raised, this goal is defeated.

Steganalysis is the art of discovering and rendering useless such covert

message [JOH01].

In other words steganalysis attempts to detect the existence of hidden

information [ALS01].

31


the steganlyst is one who applies a stganalysis in an attempt to detect the

existence of hidden information and /or render it useless. Two aspects of

steganalysis involve the detection and distortion of embedded messages

Detection requires that the analyst observes various relationships between

combinations of cover, message, stego-media, and steganograghy tool.

Distortion attacks require that the analyst manipulates the stego-media to

render the embedded information useless or remove it altogether [ETT98].

22..88 AAttttaacckkss aarree aavvaaiillaabbllee ttoo tthhee SStteeggaannaallyysstt There are many possible situations which confront the Steganalyst,

depending on what information is available. The different cases are shown

in table (2.1) [JAJ98]: Table (2.1) Steganography Attack

1-Stego-only attacks: only the stego-object is available for analysis.

2-Known cover attack: the "original" cover-object and stego-object are

both available.

3-Known message attack: At some point, the attacker may know the

hidden message. Analyzing the stego-object for patterns that correspond

to the hidden message may be beneficial for future attacks against that

system. Even with the message, this May be very difficult and may even

be considered equivalent to The Stego-only attack.

4-Chosen stego attack: The steganograghy tool (algorithm) and Steg-

object is known.

5-Chosen message attack: the steganalyst generates stego-object from

some steganography tool or algorithm from a chosen message. The goal

in this attack is to determine corresponding patterns in the stego-object

that may point to the use of specific steganography tools or algorithms.

32


6-known stego attack: The steganography algorithm (tool) is known and

both the original and stego-objects are available.

22..99 IInnttrroodduuccttiioonn ttoo tthhee CCooddee [ABD01] A code is nothing more than a set of strings over a certain alphabet. For

example, the set C= {0, 10, 110, 1110} is a code over the alphabet {0, 1}.

Of course, codes are generally used to encode message. For instance, it

may use the set C to encode the first four letters of the alphabet, as follows:

a 0

b 10

c 110

d 1110

Then can encode words (or messages) built up from these letters. The word

"cab", for instance, is encoded as

cab 110010

22..1100 WWhhyy EEnnccooddee tthhee DDaattaa [KUO70] There are three reasons to encode data that is about to be transmitted

(through space, for instance) or stored (on computer disk, for instance).

The first reason is for efficiency. It clearly makes sense to compress data

as much as possible in order to save transmission time or storage space. In

fact, data compression is very big business in the computer world. The

second reason to encode data is for error detection and /or correction.

The third reason is for secrecy, so that unauthorized persons cannot read

the data.

In other words, the goals of encoding are for efficiency, error correction,

and secrecy.

33


22..1111 HHuuffffmmaann CCooddiinngg There are different ways of encoding data and one of these ways is

Huffman coding [Web06].

In 1952, D.A.Huffman published a method for constructing highly

efficient instantaneous encoding schemes. This method is now known as

Huffman Encoding [ROM96].

The idea behind Huffman coding is simply to use shorter bit patterns

for more common characters, and longer bit patterns for less common

characters [Web06].

The method starts by building a list of the entire alphabet symbols in

descending order of their probabilities .It then constructs a tree with a

symbol at every leaf, from the bottom up. This is done in steps where, at

each step, the two symbols with smallest probabilities are selected, added

to the top of the partial tree, deleted from the list, and replaced with an

auxiliary symbol representing both of them. When the list is reduced to

just one auxiliary symbol (representing the entire alphabet) the tree is

complete [SAL95].

An Example [Web06]

To encode the letters A (0.12), E (0.42), I (0.09), O (0.30), U

(0.07), listed with their respective probabilities. Go through the

following steps:

1. Consider each of the letters as a symbol with its respective

probability.

2. Find the two symbols with the smallest probability and

combine them into a new symbol with both letters by adding

34


the probabilities. (Note1: There may be a choice between two

symbols with the same probability, if this is the case, a symbol

can be chosen, the final tree and codes will be different, but the

overall efficiency of the code will be the same)

(Note 2: Frequency counts or other values may be used instead of

probabilities)

3. Repeat step 2 until there is only one symbol left with a

probability of 1.

4. To see the code, redraw all the symbols in the form of a tree,

where each symbol contains either a single letter or splits up

into two smaller symbols. Label all the left branches of the

tree with a 0 and all the right branches with a 1. The code for

each of the letters is the sequence of 0's and 1's that lead to it

on the tree, starting from the symbol with a probability of 1.

Figure (2.6) Huffman Tree for example

5. Thus the codes for each letter are:

A = 100, E = 0, I = 1011, O = 11, U = 1010.

35


The Huffman code for the 26- letter Alphabet [ROM96]

000 E 0.1300 0

0010 T 0.0900 0 0. 3 0 0

0011 A 0.0800 1 1

0100 O 0.0800 0

0101 N 0.0700 1 0 0.580

0110 R 0.0650 0 0.28 1

0111 I 0.0650 1 1

10000 H 0.0600 0

10001 S 0.0600 1 0

10010 D 0.0400 0 0.195 0

10011 L 0.0350 1 1 0

10100 C 0.0300 0 0.305

10101 U 0.0300 1 0 1

10110 M 0.0300 0 0.11

10111 F 0.0200 1 1

11000 P 0.0200 0

11001 Y 0.0200 1 0

11010 B 0.0150 0 0.70 0

11011 W 0.0150 1 1

11100 G 0.0150 0 0 0.115 1

11101 V 0.0100 1 0.025

111100 J 0.0050 0 1

111101 K 0.0050 1 0.010 0 1 0.045

111110 X 0.0050 0 0.020

1111110 Q 0.0025 0 0.010 1

1111111 Z 0.0025 1 0.005 1

Figure (2.7) Huffman tree for the 26-letter Alphabet

36


Table (2.2) shows the letters of the alphabet with approximate

probabilities of occurrence in English, based on statistical data. The

second columns of the table show Huffman encoding scheme

(emphasizing table (2.2)) is used in this work) [ROM96].

Table (2.2) Probabilities of Occurrence in English Text

Symbol Probability Huffman code E 0.1300 000 T 0.0900 0010 A 0.0800 0011 O 0.0800 0100 N 0.0700 0101 R 0.0650 0110 I 0.0650 0111 H 0.0600 10000 S 0.0600 10001 D 0.0400 10010 L 0.0350 10011 C 0.0300 10100 U 0.0300 10101 M 0.0300 10110 F 0.0200 10111 P 0.0200 11000 Y 0.0200 11001 B 0.0150 11010 W 0.0150 11011 G 0.0150 11100 V 0.0100 11101 J 0.0050 111100 K 0.0050 111101 X 0.0050 111110 Q 0.0025 1111110 Z 0.0025 1111111

37

33

CChhaapptteerr TThhrreeee

`||vvÜÜÉÉááÉÉyyàà jjÉÉÜÜww WWÉÉvvââÅÅxxÇÇàà

YY||ÄÄxx YYÉÉÜÜÅÅttàà ))AAwwÉÉvv(*(*

Chapter Three Microsoft Word Document File Format

38

CChhaapptteerr TThhrreeee

MMiiccrroossoofftt WWoorrdd DDooccuummeenntt ffiillee 33..1111IInnttrroodduuccttiioonn Microsoft Word is a word processing software, many word versions

were written for several platforms1 including IBM PC running DOS, the

Apple Macintosh and Microsoft Windows as shown in Figure(3.1).

It is a component of the Microsoft Office System; Microsoft began

calling it Microsoft Office Word instead of merely Microsoft Word.

0

20

40

60

80

100

120

140

1983 1986 1989 1991 1995 1998 2000 2003 2006 2008

MS-DOSMacintoshWindows

ijjgjgg

Wor

d V

ersi

ons N

umbe

r

Years of Issuing

Figure (3.1) Word Versions for Different Operating Systems

1Platform: the underlying Hardware or Software for a System


33..22 22HHiissttoorryy ooff WWoorrdd Many concepts and ideas of Word were brought from Bravo the

original GUI word processor developed at Xerox PARC1 [Web08].

Bravo's creator Charles Simonyi left PARC to work for Microsoft in

1981. Simonyi hired Richard Brodie, who had worked with him on Bravo,

away from PARC that summer [Web02].

Word featured a concept of "What You See Is What You Get", or

WYSIWYG, and was the first application with such features as the ability

to display bold and italics text on an IBM PC. Word made full use of the

mouse, which was so unusual at the time that Microsoft offered a bundled

Word-with-Mouse package [Web08].

Although MS-DOS was a character-based system, Microsoft Word

was the first word processor for the IBM PC that showed actual line breaks

and typeface markups such as bold and italics directly on the screen while

editing, although this was not a true WYSIWYG system because available

displays did not have the resolution to show actual typefaces[Web02].

Word 97

Word 97 had the same general operating performance as later

versions such as Word 2000. This was the first copy of Word featuring the

"Office Assistant"2, which was an animated helper used in all Office

programs [Web08].

Word 2000

For most users, one of the most obvious changes introduced with

Word 2000 (and the rest of the Office 2000 suite) was a clipboard3 that

could hold multiple objects at once. Another noticeable change was that the

2 1:Xerox PARC Research and Development Company 1970 2:Office Assistant animated helper used in all office programs

39

3: clipboard a special file or memory area (buffer) where data is stored temporary before being copied to another location used for copy and paste.


Office Assistant, whose frequent unsolicited appearance in Word 97 had

annoyed many users, was changed to be less intrusive [Web08].

Word 2002

Word 2002 was bundled with Office XP and was released in 2001

although its appearance was different; it had many of the same features as

Word 2003. One of the key advertising strategies for the software was the

removal of the Office Assistant in favor of a new help system, although it

was simply disabled by default Word 2002[Web08].

Word 2003

For the 2003 version, the Office programs, including Word, were

rebranded to emphasize the unity of the Office suite, so that Microsoft

Word officially became Microsoft Office Word. Users continue to use both

names [Web08].

Word 2007

The release includes numerous changes, including a new XML-

based file format, a redesigned interface, and an integrated equation editor

[Web08].

Word 2008

Word 2008 is the most recent version of Microsoft Word for the

Mac, released on January 15, 2008. It includes some new features from

Word 2007[Web08].

40


33..33 MMiiccrroossoofftt WWoorrdd DDooccuummeenntt aanndd iittss CCoommppoonneennttss [Web11]

Documents in Word have a hierarchical structure as shown in the

figure (3.2)

Figure (3.2) External Structure of a Word Document Different types of properties apply to different units in hierarchy:

Section. By default a document is a single section, but setting for

margins, headers and footers, footnote, and columns apply to

whole sections so need a section break to change any of these for

only part of a document. Make a new section using Inset| Break

and selecting one of the four types of "section breaks".

Paragraph. most of formatting in Word applies at the paragraph

level indents, line spacing, default font properties, bullets etc. can

apply many aspects of paragraph formatting all at once to a

paragraph using paragraph styles .

Character. Some formatting attributes apply at the level of

individual character, such as the bold font in the first word of this

paragraph can apply a set of character attributes together using

character styles.

41


In addition to these parts of the main document, there are other special

kinds of text which word refers to as other "stories". These include

footnotes, comments, headers and footers, these items are stored separately

from the main text and require special commands to access and edit.

Customizations. such as definitions, macros and toolbars may either

be stored in the document or in the document's associated template

Styles. Are collections of format specifications which can be applied

all together to a paragraph or a group of characters. The advantage of

using styles to apply formatting is that can easily change the

formatting of all paragraphs of a certain type (e.g. examples, section,

heading or footnotes) simply by redefining the style. A linguistics

paper usually goes through a number of stages: as a term paper. As a

draft you circulate for comments as a conference handout, as a

journal submission, as camera-ready copy for a volume. Each of

these stages has its own format requirements. Using styles right from

the beginning for all formatting can save a huge amount of time over

a paper.

33..44 AAnnnnoottaattiioonn aanndd ccoollllaabboorraattiioonn ttoooollss [Web11] As a linguist, will often be working together with someone else on a

document either as a co-author, or in a student-teacher relationship.

Word has some easy-to-use tools to facilitate such collaborative work.

3.4.1 Track Changes

The “Track Changes” tool gives access to a simple method of keeping

track of the changes a particular user makes to a document. Insertions will

display in color and underlined; deletions and format changes will display

in bubbles like comments, an example of Track change can be shown in

figure (3.3) [web11]. 42


Track Changes is a way for Microsoft Word to keep track of the changes

you make to a document. Track Changes is also known as redline, or

redlining. This is because some industries traditionally draw a vertical

red line in the margin to show that some text has changed [web04].

Figure (3.3) Track change example

3.4.2 Comments

The “Comment” feature allows comments to be added to the

document. In Page Layout view, recent versions of Word will be display

comments in "bubbles" on the right side of the text (moving text over to

make room in the margin for the comment). Comments from different

reviewers will appear in different colors, comments example in figure (3.4)

[web11].

Figure (3.4) comments example

43


33..55 FFiillee FFoorrmmaatt [Web03]

A file format is a particular way to encode information for storage in

a computer file.

Since a disk drive, or indeed any computer storage, can store only bits,

the computer must have some way of converting information to 0s and 1s

and vice-versa. There are different kinds of formats for different kinds of

information. Within any format type, e.g., word processor documents, there

will typically be several different formats. Sometimes these formats

compete with each other.

Some file formats are designed to store very particular sorts of data:

the JPEG format for example, is designed only to store static photographic

images other file formats, however, are designed for storage of several

different types of data.

33..66 IIddeennttiiffyyiinngg tthhee ttyyppee ooff aa ffiillee [Web03]

Since files are seen by programs as streams of data, a method is

required to determine the format of a particular file within the file system

an example of metadata. Different operating systems have traditionally

taken different approaches to this problem, with each approach having its

own advantages and disadvantages as follows.

3.6.1 Filename Extension

One popular method in use by several operating systems, including

DOS and Windows, is to determine the format of a file based on the section

of its name following the final period. This portion of the filename is

44


Known as the filename extension For example, HTML documents are

identified by names that end with .html (or .htm) [Web03].

3.6.2 Magic Number

An alternative method, often associated with UNIX and its

derivatives, is to store a "magic number" inside the file itself. Originally,

this term was used for a specific set of 2-byte identifiers at the beginning of

a file, but since any un decoded binary sequence can be regarded as a

number, any feature of a file format which uniquely distinguishes it can be

used for identification. GIF images, for instance, always begin with the

ASCII representation of either GIF87a or GIF89a, depending upon the

standard to which they adhere [Web03].

33..77 FFiillee SSttrruuccttuurree

Each format uses structure (a way to organize data for storing) in a file

[FOL98].

There are several types of ways to structure data in a file. The most

usual ones are described in figure (3.5).

File structure

Raw memory dumps

Chunk based format

Directory based format

(RMD) (CBF) (DBF)

Figure (3.5) File Structure Types

45


3.7.1 Raw Memory Dumps/Unstructured Formats (RMD) [Web03]

Earlier file formats used raw data formats that consisted of directly

dumping the memory images of one or more structures into the file.

This has several drawbacks. Unless the memory images also have

reserved spaces for future extensions, extending and improving this type of

structured file is very difficult. On the other hand, developing tools for

reading and writing these types of files are very simple.

The limitations of the unstructured formats led to the development of

other types of file formats that could be easily extended and be backward

compatible at the same time.

3.7.2 Chunk based Formats (CBF) [Web03]

In this kind of file structure, each piece of data is embedded in a

container that contains a signature identifying the data, as well the length of

the data (for binary encoded files). This type of container is called a chunk.

The signature is usually called a chunk id, chunk identifier, or tag

identifier.

With this type of file structure, tools that do not know certain chunk

identifiers simply skip those that they do not understand. Even XML can be

considered a kind of chunk based format, since each data element is

surrounded by tags which are akin to chunk identifiers.

3.7.3 Directory based Formats (DBF) [web03]

This is another extensible format, that closely resembles a file system

(OLE Documents are actual file systems), where the file is composed of

46


'directory entries' that contain the location of the data within the file itself

as well as its signatures (and in certain cases its type). Good examples of

these types of file structures are disk images, OLE documents [Web03].

33..88 SSttrruuccttuurree SSttoorraaggee The lowest level of organization that is normally imposed on a file is a

stream of bytes.

By storing data in a file which is merely as a stream of bytes, the ability to

distinguish among the fundamental information units of data will be lost.

These fundamental pieces of information are called fields. Fields are

grouped together to form records. Records are grouped together to form

Block [FOL98] as shown in figure (3.6).

In persisten

treated as a

the disk. T

file system

Block

Record

0, 1

Field Stream of bits

Figure (3.6) logic view of file

47

t storage, normally files are stored in the form of bytes. A file is

raw sequence of bytes. The entire file is stored in the blocks on

hese blocks are scattered on the disk. When reading this file, the

manages its pointers and returns a sequence of bytes [CHA00].


Structure storage follows a different approach to store a file and its data on

the persistent storage. Structure storage provides a way by defining how to

treat a file as a structured collection of objects. These objects are storages

and streams as shown in figure (3.7).

Root

STORAGE STORAGE STORAGE STREAM

STORAGE STREAM STREAM

Figure (3.7) Storage and Stream Structure

A storage object is kind of a directory and it can contain other storage

objects and stream objects that can be thought of as a stream object as a

file. Like a file, a stream contains data stored as a consecutive sequence of

bytes. A compound file is a combination of these two objects [CHA00].

A compound file is file which contains different types of data saved in a

structured format having a compound file which has some text, some

images and other data. Now we want to add one more object to a file. In the

traditional approach, when saving a file, the file system rewrites the entire

data. But the structured storage approach eliminates this rewriting process

and increases the read/write performance. The new data is written to the

next available location in permanent storage and the storage object updates

the table of pointers it maintains to track the locations of its storage objects

and stream objects [CHA00].

48


Here are some other benefits:

Structured storage approach provides control over separate

objects. It can read/write separate objects instead of the entire

compound file [CHA00].

More than one user can concurrently read/write the same file

[CHA00].

33..99 MMiiccrroossoofftt CCoommppoouunndd DDooccuummeenntt FFiillee FFoorrmmaatt ((MMCCDDFFFF)) A word file may contain Excel sheet and chart, an image, a table, and

some macros is an example of compound file.

Files which use MCDFF (Microsoft Compound Document File

Format) include output files from MS Office 97-2003, which consist of the

applications like MS Word, PowerPoint, and Excel [CHA00].

The Microsoft Compound Document File Format (MCDFF) 2003 is a

document file format based on OLE (Object Linking and Embedding),

which is used for saving various resources as an integrated document in

Microsoft [MIC07].

A storage component may exist as a standalone component. Each

storage component may have one or more sub-storage components and

stream components. Also the root component may have stream components

directly within it [JIT06].

49


50

33..1100 SSttrruuccttuurree ooff aa WWoorrdd DDooccuummeennttss ffiilleess

Let's take a look at the structure of a Word document with an embedded

Excel object, shown below in Figure (3.8).

MS Word

JPEG Image

Object Pool

Word Document

Data Table Summary Information

Document Summary Information

CompObj

Excel Sheet

Work SummaryInformation

DocumentSummary

Book Information

Figure (3.8) Sample of Word document storage format

The binary format for Microsoft Word 97 and later versions is based on

a structure referred to as a .doc file or compound file.

A Word .doc file consists of a [MIC07]:

I. Word Document (Main stream)

II. Summary information stream

III. Table stream

IV. Data stream

V. Custom XML storage (Added in Word 2007)

Zero or more object streams which contain private data for OLE 2.0

objects embedded within the Word document [MIC07].

The 'MS Word' component is the root component containing several

streams and one storage item. Different parts of the document such as the


actual contents, any table inserted, the CompObj associated with the DLL

files for the objects, the Summary Information for the content, any image

inserted, and the Document Summary Information, all take the form of

streams under the root component. The ObjectPool is the collective storage

of all the sub-storage components. Figure (3.8) displays samples of the sub-

storage Excel component. The Excel Sheet itself is a storage component

within the ObjectPool and has its own streams of information the

Workbook, SummaryInformation and DocumentSummaryInformation

[JIT06].

Custom XML Datastore (Added in Word 2007): The custom XML

data store specifies custom defined XML files contained in the

binary Microsoft Word 97 format or the Office Open XML Formats

[MIC07].

Data stream: The stream within a Word .doc file that contains

various data that anchor to characters in the main stream. For

example, binary data are described in-line pictures and/or form fields

[MIC07].

Main stream: The stream within a Word .doc file that contains the

bulk of Word‘s binary data [MIC07].

Object storage: A storage that contains binary data for an embedded

OLE 2.0 object. Multiple instances are referred to as storages

[MIC07].

Stream: The physical encoding of a Word document's text and sub

data structures in a random access stream within a .doc file [MIC07].

Summary Information Stream: The stream within a Word .doc file

that contains the document summary information [MIC07].

51


Table stream: The stream within a Word .doc file that contains the

various plcf‘s and tables that describe a document‘s structures

[MIC07].

33..1111 FFoorrmmaatt ooff tthhee MMaaiinn SSttrreeaamm

The main stream of a Word binary file (complex format) consists of

the Word file header (FIB), the text, and the formatting information.

FIB (File Information Block)

The header of a Word file begins at offset 0 in the file. This gives

the beginning offset and lengths of the document's text stream and

subsidiary data structures within the file. It also stores other file status

information.

The FIB contains a "magic number" and pointers to the various

other parts of the file, as well as information about the length of the file.

The FIB is defined in the structure definition section of this document

[MIC07].

Text

The text part contains all text of the document (including footnotes,

header and footer lines, etc.) the document's text is also located in the main

stream [DIA08].

Word has used this same file format since its first version. This means

that Word 1.0 can read Word 5.0 files and vice-versa. This compatibility

was accomplished by defining all structures to be larger than they needed

to be and setting all reserved fields to zero for using in future versions.

52


Reserved pointers in the document header have been used to add entirely

new document sections (such as document retrieval information and

bookmark tables) [Web09].

Because of the important issue of compatibility with future versions, all

fields in all structures which are not currently being used MUST be filled

with zeros. When the fields are finally defined for a new feature, they will

make zero either the default value of those fields or make zero represent un

initialized state which will be ignored [Web09].

33..1122 MMCCDDFFFF mmeettaaddaattaa MCDFF uses metadata to manage information about Streams,

Storage. Table (3.1) describes the type of information contained in each

metadata in MCDFF [HYU08]. Table (3.1) MCDFF Metadata

Name of metadata Information Contained Header Signature, Pointer Table of BAT

BAT Block Allocation Table

SBAT Small Block Allocation Table

Directory Stream & Storage information

The exact format structure of these metadata was provided by the

Spreadsheet Project of Open Office.org Documentation of the Microsoft

Compound Document File Format [DAN07] and the Apache POIFS

Project of Apache.org. [MAR07] because POIFS file systems are called

"file system", because they contain multiple embedded files in a manner

similar to the traditional file systems if had a word processor file with the

extension ".doc", would actually have a POIFS file system with a

document file archived inside of that file system. [MAR07].Most

53


operating systems, including Microsoft Windows manage hard disk

drives by dividing their storage space into units known as partitions. So

before being able to store data on a partition, it must be formatted.

Formatting a partition organizes the associated space into what is called a

filesystem, which provides space for storing the names and attributes of

files as well as the data they contain. Microsoft Windows supports

several types of filesystems, such as FAT and FAT32,Formatting a disk

divides the disk into tracks and sectors, each track is divided into sectors

sometimes called disk blocks as shown in figure (3.9) where Partitions

comprise the logical structure of a disk drive, the way humans and most

computer programs understand the structure. However, disk drives have

an underlying physical structure that more closely resembles the actual

structure of the hardware.

Figure (3.9) the structure of a hard disk [MCC99]

MCDFF uses two types of data unit: Small Block (Sector) and Big Block

(Block) [HYU08].

If the Stream size is less than 4096, the file is stored in small blocks and

the SBAT is used to walk the small blocks (Sector) making up the file.

If the file size is 4096 or larger, the file is stored in big blocks (Blocks)

54


and the main BAT is used to walk the big blocks making up the file

[MAR07].

The (zero-based) index of a sector is called sector identifier (SecID)

SecIDs are signed 32-bit integer values. If a SecID is not negative, it must

refer to an existing sector. If a SecID is negative, it has a special meaning.

–1 Free SecID Free sector, may exist in the file, but is not part of

any stream [DAN07].

–2 End Of Chain SecID Trailing SecID in a SecID chain

–3 SAT SecID Sector is used by the sector allocation table

–4 MSAT SecID Sector is used by the master sector allocation

table.

3.12.1 Compound Document Header The compound document header (simply “header” in the

following) contains all data needed to start reading a compound

document file. The header is always located at the beginning of the file;

this implies that the first sector (with SecID 0) always starts at file offset

512.The first 64 bits of the header form id or magic number identifier of

office file.

The header also contains an array of block numbers. These block

numbers refer to blocks in the file. When these blocks are read together

they form the Block Allocation Table. The header also contains a pointer

to the first element in the property table, also known as the root element,

and a pointer to the small Block Allocation Table (SBAT) [MAR07].

The block allocation table or BAT, along with the property table

specifies which blocks in the file system belong to which files [MAR07].

The Contents of the compound document header structure are

described in the following Table.

55


Table (3.2) compound document header structure [DAN07]. offset Size Contents 0 8 Compound document file identifier: D0 CF 11 E0 A1 B11AE1 8 16 Unique identifier (UID) of this file 24 2 Revision number of the file format (most used is 003E) 26 2 Version number of the file format (most used is 0003) 28 2 Byte order identifier FEH FFH = Little-Endian

FFH FEH = Big-Endian 30 2 Size of a sector in the compound document file in power-of-two

(ssz), real sector size is sec_size = 2ssz bytes (minimum value is 7 which means 128 bytes, most used value is 9 which means 512 bytes)

32 2 Size of a short-sector in the short-stream container stream in power-of-two (sssz), ) real short-sector size is short_sec_size = 2sssz bytes (maximum value is sector size ssz, see above, most used value is 6 which means 64 bytes)

34 10 Not used 44 4 Total number of sectors used for the sector allocation table 48 4 SecID of first sector of the directory stream 52 4 Not used 56 4 Minimum size of a standard stream (in bytes, minimum allowed

and most used size is 4096 bytes), streams with an actual size smaller than (and not equal to) this value are stored as short-streams

60 4 SecID of first sector of the short-sector allocation table or -2 (End Of Chain SecID) if not extant

64 4 Total number of sectors used for the short-sector allocation table

68 4 SecID of first sector of the master sector allocation table or -2 (End Of Chain SecID) if no additional sectors used

72 4 Total number of sectors used for the master sector allocation table

76 436 First part of the master sector allocation table containing 109 SecIDs

The following header format structure in Table (3.3) is used to give Block

information if the file is stored in Block.

Note: The shadow cells in Table (3.3) are used in this work.

56


Table (3.3) Header (block 1) -- 512 (0x200) bytes [MAR07]

Field Description Offset Length Default value or const FILETYPE Magic

number identifying this as a POIFS files system.

0x0000 Long 0xE11AB1A1E011CFD0

UK1 Unknown constant

0x0008 Integer 0

UK2 Unknown Constant

0x000C Integer 0


0x0014 Integer 0

UK4 Unknown Constant (revision?)

0x0018 Short 0x003B

UK5 Unknown Constant (version?)

0x001A Short 0x0003


0x001C Short -2

LOG_2_BIG_BLOCK_SIZE Log, base 2, of the big block size

0x001E Short 9 (2 ^ 9 = 512 bytes)

LOG_2_SMALL_BLOCK_SIZE Log, base 2, of the small block size

0x0020 Integer 6 (2 ^ 6 = 64 bytes)


0x0024 Integer 0


0x0028 Integer 0

BAT_COUNT Number of elements in the BAT array

0x002C Integer required

PROPERTIES_START Block index of the first block of the property table

0x0030 Integer required


0x0034 Integer 0


0x0038 Integer 0x00001000

SBAT_START Block index of first big block containing the small block allocation table (SBAT)

0x003C Integer -2

57


SBAT_Block_Count Number of big blocks holding the SBAT

0x0040 Integer 1

XBAT_START Block index of the first block in the Extended Block Allocation Table (XBAT)

0x0044 Integer -2

XBAT_COUNT Number of elements in the Extended Block Allocation Table (to be added to the BAT)

0x0048 Integer 0

BAT_ARRAY Array of block indices constituting the Block Allocation Table (BAT)

0x004C, 0x0050, 0x0054 ... 0x01FC

Integer[ ]

-1 for unused elements, at least first element must be filled.

N/A Header block data not otherwise described in this table

N/A N/A -1

3.12.2 Byte Order [DAN07]

All data items containing more than one byte may be stored

using the Little-Endian or Big-Endian method, but in real world

applications only the Little-Endian method is used. The Little-

Endian method stores the least significant byte first and the most

significant byte last. This applies to all data types like 16-bit

integers, 32-bit integers, and Unicode characters.

Example: The 32-bit integer value 13579BDFH is converted

into the Little-Endian byte sequence DFH 9BH 57H 13H, or

to the Big-Endian byte sequence 13H 57H 9BH DFH.

58


3.12.3 Sector File Offsets [DAN07]

With the values from the header it is possible to calculate a file

offset from a SecID:

sec_pos(SecID) = 512 + SecID · sec_size …………….(3.1)

= 512 + SecID · 2 ssz

Example with ssz = 10 and SecID = 5:

sec_pos(SecID) = 512 + SecID · 2 ssz

= 512 + 5 · 210

= 512 + 5 · 1024

= 5632.

Note: The previous equation is used to calculate Block Position too.

3.12.4 Property Table (Directory)

The Property Table is essentially nothing more than the directory

system. Properties (directories) are 128 byte records contained within the

512 byte blocks. Each directory entry refers to storage or a stream in the

compound document. the zero-based index of a directory entry is called

directory entry identifier (DirID). There is a special directory entry at the

beginning of the directory (with the DirID 0). It represents the root

storage and is called root storage entry [DAN07]. The contents of the

directory entry structure are described in the following table.

59


Table (3.4) directory entry structure [DAN07]

Offset Size Contents 0 64 Character array of the name of the entry, always 16-bit Unicode

characters, with trailing zero character (results in a maximum name length of 31 characters)

64 2 Size of the used area of the character buffer of the name (not character count), including the trailing zero character (e.g. 12 for a name with 5 characters: (5+1)·2 = 12)

66 1 Type of the entry: 00H = Empty 03H = LockBytes (unknown) 01H = User storage 04H = Property (unknown) 02H = User stream 05H = Root storage

67 1 Node colour of the entry: 00H = Red 01H = Black

68 4 DirID of the left child node inside the red-black tree of all direct members of the parent storage (if this entry is a user storage or stream), –1 if there is no left child

72 4 DirID of the right child node inside the red-black tree of all direct members of the parent storage (if this entry is a user storage or stream), –1 if there is no right child

76 4 DirID of the root node entry of the red-black tree of all storage members (if this entry is a storage), –1 otherwise

80 16 Unique identifier, if this is a storage (not of interest in the following, may be all 0)

96 4 User flags (not of interest in the following, may be all 0) 100 8 Time stamp of creation of this entry. Most implementations do not

write a valid time stamp, but fill up this space with zero bytes. 108 8 Time stamp of last modification of this entry. Most implementations

do not write a valid time stamp, but fill up this space with zero bytes. 116 4 SecID of first sector or short-sector, if this entry refers to a stream

,SecID of first sector of the short-stream container stream, if this is the Root storage entry,0 otherwise

120 4 Total stream size in bytes, if this entry refers to a stream, total size of the short stream container stream, if this is the root storage entry, 0 otherwise

124 4 Not used The following property Format Structure in Table (3.5) is used to give

Block information if the file is stored in Block.

Note: the shadow cells in Table (3.5) are used in this work.

60


Table (3.5) Property -- 128 (0x80) byte block [MAR07]

Field Description Offset Length Default value or const

NAME A unicode null-terminated uncompressed 16bit string (lose the high bytes) containing the name of the property.

0x00, 0x02, 0x04, ... 0x3E

Short[] 0x0000 for unused elements, field required, 32 (0x40) element max

NAME_SIZE Number of characters in the NAME field

0x40 Short Required

PROPERTY_TYPE Property type (directory, file, or root)

0x42 Byte 1 (directory), 2 (file), or 5 (root entry)

NODE_COLOR Node color 0x43 Byte 0 (red) or 1 (black)

PREVIOUS_PROP Previous property index

0x44 Integer -1

NEXT_PROP Next property index 0x48 Integer -1 CHILD_PROP First child property

index 0x4c Integer -1

SECONDS_1 Seconds component of the created timestamp?

0x64 Integer 0

DAYS_1 Days component of the created timestamp?

0x68 Integer 0

SECONDS_2 Seconds component of the modified timestamp?

0x6C Integer 0

DAYS_2 Days component of the modified timestamp?

0x70 Integer 0

START_BLOCK Starting block of the file, used as the first block in the file and the pointer to the next block from the BAT

0x74 Integer Required

SIZE Actual size of the file this property points to. (Used to truncate the blocks to the real size).

0x78 Integer 0

61


3.14.5 Block Allocation Table (BAT)

The BAT (Block Allocation Table) is the main table for spaces

within MCDFF, which is needed to read any other Stream in the file

[HYU08].

The BAT blocks are pointed at by the bat array contained in the

header these blocks form a large table of integers. These integers are

block numbers. The Block Allocation Table holds chains of integers

[MAR07].

The elements in these chains refer to blocks in the files. The

starting block of a file is NOT specified in the BAT. It is specified by

the property of a given file. The elements in this BAT are both the block

number (within the file minus the header) and the number of the next

BAT element in the chain. This can be thought of as a linked list of

blocks. The BAT array contains the links from one block to the next,

including the end of chain marker [MAR07]. The BAT format structure

is shown in Table (3.6).

Here's an example: Let's assume that the BAT begins as follows:

BAT [0] = 2

BAT [1] = 5

BAT [2] = 3

BAT [3] = 4

BAT [4] = 6

BAT [5] = -1

BAT [6] = 7

BAT [7] = -2

62


Now, if we have a file whose Property Table entry says it begins with

index 0, walk the BAT array and see that the file consists of blocks 0

(because the start block is 0), 2 (because BAT[ 0 ] is 2), 3 (BAT[ 2 ] is

3), 4 (BAT[ 3 ] is 4), 6 (BAT[ 4 ] is 6), and 7 (BAT[ 6 ] is 7). It ends at

block 7 because BAT [7] is -2, which is the end of chain marker.

Similarly, a file beginning at index 1 consists of blocks 1 and 5 and

block 5 refers to unused block.

The other special number in a BAT array is:

-3, which indicate a "special" block, such as a block used to make

up the Small Block Array, the Property Table, the main BAT, or

the SBAT [MAR07].

Table (3.6) Block Allocation Table Block [MAR07]

Field Description Offset Length Default value or const

BAT_ELEMENT Any given element in the BAT block

0x0000, 0x0004, 0x0008, ... 0x01FC

Integer -1 = unused -2 = end of chain -3 = special (e.g., BAT block) All other values point to the next element in the chain and the next index of a block composing the file.

In the physical structure of an MCDFF file, each Block is numbered with

an index number under a Header. Figure (3.10) shows the process of

accessing “Sample A Stream”. The first index number for “Sample A

Stream” is included in its Directory entry. It accesses the BAT to find the

index number of the other Blocks that “Sample A Stream” uses – in this

Example, if the first index number is 1st in Directory Entry, “Sample A

Stream” consists of three Blocks as 1st, 4th and 5th from BAT [HYU08].

63


Figure (3.10) MS Compound files structure [HYU08] 3.12.6 Sector Allocation Table (SAT)

The Sector Allocation Table (SAT) is an array of SecIDs. It

contains the SecID chain of all user streams. The size of the SAT

(number of SecIDs) is equal to the number of existing sectors in the

compound document file [DAN07].

33..1133 OOffffiiccee AAuuttoommaattiioonn Office Automation /OLE Automation (later renamed by Microsoft to

just Automation) is an inter-process communication mechanism based on

Component Object Model (COM) that was intended for use by scripting

languages – originally Visual Basic – but now are used by languages run

on Windows. It provides an infrastructure whereby applications called

automation controllers can access and manipulate (i.e. set properties of or

call methods on) shared automation objects that are exported by other

64


applications in OLE Automation. The automation controller is the "client"

and the application exporting the automation objects is the "server"

[Web10].

33..1144 PPIIAA ffoorr MMiiccrroossoofftt OOffffiiccee 22000033 The following tables list the PIAs available for use with Office

2003.Table (3.7) lists Microsoft Office 2003 applications and component

type libraries that have the same version number and that are signed with

the same key [KHO05].

Table (3.7) Office 2003 applications and component type libraries with the same

version number, signed with the same key [KHO05]

Office 2003 Application or component

PIA Name PIA Namespace

Microsoft Office 11.0 Object Library

Office.dll Microsoft.Office.Core

Mirosoft Word 11.0 Object Libyrar

Microsoft.Office.Interop.Word.dll Microsoft.Office.Interop.Word

33..1155 WWoorrdd OObbjjeecctt MMooddeell Word provides hundreds of objects. These objects are organized in a

hierarchy that closely follows the user interface.

Word Visual Basic Helps to contain a diagram of Word's object

model. The figure is "live" – when clicking on an object you will be taken

to the Help topic for that object. Figure (3.11) shows the portion of the

object model diagram that describes the Document object [GRA01].

The Key object in Word is Document, which represents a single, open

document; the Document object has lots of properties and methods. Many

of its properties are references to collections such as Paragraphs, Tables

and Sections. Each of these collections contains references to objects of the 65


indicated type, each object contains information about the appropriate piece

of the document. For example, the Paragraph object has properties like

KeepWithNext and Style, as well as methods like Indent and Outdent

[GRA01].

Figure (3.11).Word Object Model – The Word Visual Basic Help file offers a global

view of Word's structure [GRA01].

66


33..1166 PPllaattffoorrmm IInnvvookkee ((PPIInnvvookkee)) There is a need to call a function located in an unmanaged DLL

library from within the .NET framework. Platform invokes or PInvoke is

the technique used to make this happen [Web01].

Figure (3.12) a platform invokes call to an unmanaged DLL function [Web01].

When platform invoke calls an unmanaged function, it performs the

following sequence of actions [Web01]:

I. Locates the DLL containing the function.

II. Loads the DLL into memory.

III. Locates the address of the function in memory and pushes its

arguments onto the stack, marshaling data as required.

Note Locating and loading the DLL, and locating the address of

the function in memory occur only on the first call to the function.

67

IV. Transfers control to the unmanaged function.


68

33..1177 AApppplliiccaattiioonn PPrrooggrraammmmiinngg IInntteerrffaacceess ((AAPPII)) [Web12] An API is a set of functions that can be used to work with a

component, application, or operating system. Typically, an API consists of

one or more DLLs that provide some specific functionality.

DLLs are files that contain functions that can be called from any

application running in Microsoft Windows.

33..1188 OOffffiiccee AApppplliiccaattiioonn PPrrooggrraammmmiinngg IInntteerrffaacceess ((AAPPIIss))

[[WWeebb0055]]

Office binary file formats are designed to be accessed through the

Office Application Programming Interfaces (APIs), instead of by direct

manipulation of the file format. Because of the complexity of the formats,

direct manipulation can cause corruption and is strongly discouraged.

The Office 97-2003 binary file formats use the Windows Structured

Storage APIs. The Office-specific information is stored as streams in this

more generalized format. Common elements, such as document properties,

can be accessed through the Structured Storage APIs.

44

CChhaapptteerr FFoouurr

ccÜÜÉÉÑÑÉÉááxxww [[||ww||ÇÇzz ffççááààxxÅÅ ||ÇÇ ;;`VVWWYYYY<

Chapter Four Proposed Hiding System in (MCDFF)

CChhaapptteerr FFoouurr

PPrrooppoosseedd HHiiddiinngg SSyysstteemm iinn MMiiccrroossoofftt CCoommppoouunndd DDooccuummeenntt FFiillee FFoorrmmaatt

((MMCCDDFFFF)) 44..11 IInntt

he proposed system is on of text Steganography methods. This

system will be used for embedding a Steganography string into

a document, which is Microsoft Word document file 2003.

rroodduuccttiioonn

The proposed System Embeds Steganography string in Unused

Block of Microsoft Compound Document Binary File format

(MCDFF). It consists of two processes for Embedding: Cover

Generation process, Embedding process as shown in the Block

diagram (4.1).

TT

69


A Z

Secret message Microsoft Word Document file (doc.)

70

Encoding Secret Message with Huffman Coding

01

Cover Generation Process

Binary message Document to be Collaborative writing efforts Embedding Process

Hiding encoded Secret message in MCDFF

Sending

Stegodocument Binary Hidden data

Extracting Secret Message from Stegodocument

0 1

Decoding Extracting hidden Data To finding Secret Message

Secret message

A Z

Figure (4.1) Block Diagram for Proposed System


44..22 CCoovveerr GGeenneerraattiioonn PPrroocceessss Cover Generation process makes data embedding disguised to be the

product of a collaborative document authoring effort. That is, the

stegodocument is made to appear to be the work of multiple authors. To

facilitate communication of the authors during the collaborative

document authoring process, the word processor records the exact

modifications by an author and embeds the ways of revision as change

tracking information into the document. From such change tracking

information, it can discern the exact changes made by a prior author, and

can recover a prior version of the document if necessary (see section 3.4

Annotation and collaboration tools).

Figure (4.2) Screenshot of Microsoft Word in case of collaborative document authoring

Figure (4.2) shows an example of the collaborative document

authoring process in Microsoft Word, where an author is modifying a

Document and the word processor has tracked the author’s

modifications.

71


Each collaborating author can accept or reject individual or all

modifications made by another author. It is a common practice for a

collaborating author to review and then accept or reject each modification

in a document first before performing his or her own corrections.

Once upon a time, Microsoft invented "Track Changes". "Authors" put

"changes" into their documents.

More recently, "Reviewers" make "revisions" to their documents and

"revisions" are one kind of "markup".

The basic idea of the proposed system is to degenerate the contents of a

cover document D to arrive at another document D' and embedding a

secret message M in D' during the Embedding process, as shown in Fig.

(4.3).The degeneration introduces errors into the degenerated document

D' such that the degenerated document appears to be a preliminary

work by a virtual author A', which is to be revised later by another

author.

Figure (4.3) Author A sends a stegodocument S with an embedded message M

to a recipient B after embedding M into a cover document D' to form S that appears to be the collaborative product of multiple authors A and A'.

72


A binary secret message M is embedded inside a cover document D' to

obtain a stegodocument S.

Microsoft Word documents have been chosen as cover media, which

provide change tracking facilities to materialize the proposed method.

Communications via Word documents are commonplace for personal,

business, or academic purposes these days and greatly used in Middle

East. The transmissions of such documents will not therefore, be under

close scrutiny.

Most of the works cited in the introduction use the technique of

modifying a cover medium to embed information. This type of data

hiding generally assumes that the cover medium used is unknown to an

adversary, or otherwise, the discrepancies between the cover medium and

the corresponding stegomedium will arouse suspicion. On the other hand,

the proposed method provides legitimate cases in using a known cover

document. For example, an already published document that is

collaboratively authored can be used as a cover document .The

stegodocument S appears to be the version of the paper before change

tracking information removal and submission for publication. The

transmission of S by one of the collaborating authors to another author, a

colleague, or a supervised student of the author is reasonable.

A colleague or a student receiving the document containing the change

tracking information can learn of the mistakes made by a colleague and

the appropriate corrections to be made thereof .

44..33 EEmmbbeeddddiinngg PPrroocceessss This method of hiding data in MCDFF is to hide information in

unused space. Unused space occurs as unused Block as follows:

73


((44..11)) AAllggoorriitthhmm ffoorr HHiiddiinngg DDaattaa

IInnppuutt:: DDooccuummeenntt ooff MMiiccrroossoofftt CCoommppoouunndd DDooccuummeenntt FFiillee FFoorrmmaatt

((MMCCDDFFFF))..

OOuuttppuutt:: SStteeggooddooccuummeenntt

SStteepp11:: OOppeenn MMCCDDFFFF ffiillee..

SStteepp22:: RReeaadd SSeeccrreett MMeessssaaggee ffrroomm uusseerr..

SStteepp33:: EEnnccooddee SSeeccrreett MMeessssaaggee wwiitthh HHuuffffmmaann CCooddiinngg..

SStteepp44:: SSeeaarrcchh ffoorr UUnnuusseedd BBlloocckk iinn MMCCDDFFFF ffiillee..

SStteepp55:: iinnsseerrtt SSeeccrreett MMeessssaaggee iinnttoo UUnnuusseedd BBlloocckk OOFF MMCCDDFFFF ffiillee..

SStteepp66:: SSaavvee tthhee ddooccuummeenntt ffiillee..

SStteepp77:: EEnndd..

74


Hiding Algorithm can be described as follows:

Step1

Open document file with Track change information (Microsoft Word

Document 2003) see Appendix B.

Step2

Enter the secret message intended for hiding.

Step3

Encode that message with Huffman coding.

Step4

In this step the Search Unused Block Algorithm is called for finding the

Unused Block Address in the document Binary file format.

Step5

After finding the Unused Block Address, it will hide the encoded secret

message in it.

Step6

Save the document file with hidden data.

Step7

End.

75


76

Open MCDFF file

Read Secret Message

Encode Secret message with Huffman coding

Start

Search for unused Block in MCDFF file

Add Secret Message into Unused Block of MCDFF

End

1

Save MCDFF file

Figure (4.4) Hiding Algorithm Flowchart


((44..22)) AAllggoorriitthhmm ffoorr SSeeaarrcchh UUnnuusseedd BBlloocckk

IInnppuutt:: DDooccuummeenntt ooff MMiiccrroossoofftt CCoommppoouunndd DDooccuummeenntt BBiinnaarryy FFiillee FFoorrmmaatt ((MMCCDDFFFF))..

OOuuttppuutt:: UUnnuusseedd BBlloocckk LLooccaattiioonn..

SStteepp11:: LLooaaddiinngg CCoommppoouunndd DDooccuummeenntt HHeeaaddeerr ooff MMCCDDFFFF ffiillee..

SStteepp22:: EExxttrraaccttiinngg iinnffoorrmmaattiioonn aanndd ooffffsseett ffrroomm HHeeaaddeerr lliikkee ((MMiiccrroossoofftt ssiiggnnaattuurree,, BBlloocckk ssiizzee,, BBlloocckk iinnddeexx ooff tthhee ffiirrsstt bblloocckk ooff tthhee pprrooppeerrttyy ttaabbllee ((ffiirrsstt DDiirreeccttoorryy)),, bbyyttee oorrddeerriinngg,, BBlloocckk AAllllooccaattiioonn TTaabbllee ((BBAATT)) IIDD,, mmiinniimmuumm ssiizzee ooff aa ssttrreeaamm))..

SStteepp33:: GGoo ttoo tthhee FFiirrsstt DDiirreeccttoorryy ((RRoooott)) AAddddrreessss..

SStteepp44:: EExxttrraacctt iinnddeexx ooff ffiirrsstt BBlloocckk iinn ffiillee ((ssttaarrttiinngg BBlloocckk))..

SStteepp55:: GGoo ttoo tthhee BBlloocckk AAllllooccaattiioonn TTaabbllee ((BBAATT)) AAddddrreessss..

SStteepp66:: LLooaaddiinngg BBlloocckk AAllllooccaattiioonn TTaabbllee ((BBAATT))..

SStteepp77:: AAcccceessssiinngg ffrroomm iinnddeexx ooff tthhee ffiirrsstt BBlloocckk iinn ffiillee ttoo aallll ootthheerr BBlloocckkss..

SStteepp88:: iiff BBlloocckk iinnddeexx == --11

-- CCaallccuullaattee tthhee AAddddrreessss ooff BBlloocckk iinnddeexx iinn ffiillee.. -- RReeccoorrdd tthhee BBlloocckk aass UUnnuusseedd BBlloocckk..

SStteepp 99:: EEllssee

IIff ((NNoott EEnndd ooff BBAATT)) GGoo ttoo sstteepp77..

SStteepp 1100:: EEnndd..

77


Search Unused Block Algorithm can be described as follows:

Step1

Loading header of document file.

Step2

Extracting from header offset and information about document file

metadata like size of block, Root ID, BAT ID, minimum size of a

stream.

Step3

After finding the block index of Root, its address in file can be

calculated by using equation (3.1)

sec_pos (SecID) = 512 + SecID · sec_size ……. (3.1)

And go to its Address.

Step4

Loading Root, and extracting from it Block index of first Block in file.

Step5

After finding BAT ID its Address can be calculated by using equation

(3.1) and go to its address.

Step6

Loading BAT.

78


Step7

Accessing from first block all other blocks in the file.

Step8

If Block index = -1, this is Unused Block so calculate its address using

equation (3.1).

Step9

If Block index < > -1 go to step7 to loading another block index and test

it until End of BAT.

Step10

End.

79


Loading BAT

Accessing from first BlockID to other Blocks in file

80

If BlockID = -1

No Yes

1

End

Record Unused Block Address

Calculate BlockID Address

Extract index of first Block in file

Go to BAT Address

Go to Root Address

Extracting information & Offset from Header

Loading Header of MCDFF file

Figure (4.5) Search Unused Block Algorithm

Flowchart


81

((44..33)) AAllggoorriitthhmm ffoorr EExxttrraaccttiinngg HHiiddddeenn ddaattaa

IInnppuutt:: SStteeggooddooccuummeenntt

OOuuttppuutt:: HHiiddddeenn ddaattaa..

SStteepp11:: OOppeenn SStteeggooddooccuummeenntt..

SStteepp22:: SSeeaarrcchh ffoorr UUnnuusseedd BBlloocckk iinn SStteeggooddooccuummeenntt..

SStteepp33:: EExxttrraacctt SSeeccrreett MMeessssaaggee ffrroomm UUnnuusseedd BBlloocckk ooff SStteeggooddooccuummeenntt..

SStteepp44:: DDeeccooddee SSeeccrreett MMeessssaaggee

SStteepp55:: EEnndd..


Extracting Algorithm can be described as follows:

Step1

Open Setgodocument (Document file + hidden data)

Step2

This step is assigned for calling Search Unused Block Algorithm for

finding Unused Block location.

Step3

Extracting Secret message from unused block.

Step4

Decode binary secret message.

Step5

End.

82


83

Figure (4.6) Extracting Algorithm Flowchart

Search for Unused Block

1

Extract Secret Message from Unused Block

Start

Open Stegodocument

Decode Secret Message

End

55 CChhaapptteerr FFiivvee

XXååÑÑxxÜÜ||ÅÅxxÇÇààttÄÄ

eexxááââÄÄààáá 99

WW||áávvââáááá||ÉÉÇÇ

Chapter Five Experimental Results and Discussion

CChhaapptteerr FFiivvee

EExxppeerriimmeennttaall RReessuullttss aanndd DDiissccuussssiioonn

55..11 IInntt

n this chapter, the Implementation of the proposed system is

explained. The proposed system is built using Microsoft Visual C

sharp .Net 2003 under Windows Xp as Operating System,

Microsoft Word Document 2003, Office Automation Technique provided

by Microsoft.

rroodduuccttiioonn

II To hide a secret message in Unused Block we must get Microsoft

Office Word 2003 Binary File Format Specification and because document

File Format developers view their specification documents as trade secret,

therefore do not release them to the public; Start working with Automation

Technique provided by Microsoft (See section 3.13 office Automation).

This Technique is also used in IEEE Research published in March 2007

"New Steganographic method for data hiding in Microsoft Word

Documents by a Change Tracking Technique" instead of Microsoft Office

Word Binary File Format Specification.

In order to work with Word data and its application to exchange data

with other applications, Automation technique allows return, edit, and

export data by referencing another application's objects, properties, and

methods.

Accessing Word components from C# isn’t quite as straightforward as

many other features of C# and the .NET FrameworkT simply is needed to

know what to reference and how to use the components.

84


Microsoft announced that its two core strategic technologies were Win32

API and the Component Object Model (COM). The Win32 API is

supported on all Windows operating systems, including 16-bit systems

(Microsoft Windows 95, Windows 98, and Windows Millennium Edition)

and 32-bit systems (Windows 2000, and Windows XP)

At that time .NET Framework is firstly introduced by Microsoft, the

concept of managed code and unmanaged code as two different

programming models was introduced as well. Microsoft defines that

managed code as the code generated by the .NET Framework and could be

executed by the common language runtime (CLR).

The common language runtime manages memory and validates code to

make sure it doesn't attempt to perform illegal operations such as access

memory that doesn't belong to it. The runtime provides access to Microsoft

.NET Framework and the Base Class Libraries.

On the other hand, the unmanaged code is any other code that doesn't

match the pervious definition. As a result, all the code created and

generated before the .NET Framework is released considered unmanaged

code. This unmanaged code contains WIN32 APIs, valuable external

libraries, COM components, and COM+ services, and all of these are so

useful and important.

The dilemma now is: the .NET Framework which is used in this work is

the current development environment only accepts managed code. It has

already made valuable libraries and components but all are unmanaged

codes so there is a need to use this valuable unmanaged code while

working under the .NET Framework environment.

85


to solve this problem and for backward compatibility we find that

Microsoft fires another concept and calls it "Interoperating with

unmanaged code" Which is how to call or use unmanaged code form

within managed code and vice versa. Then it divides this process into two

categories which are:

I. Framework (managed code) using the COM Interop technique - How

to call WIN APIs and DLLs (unmanaged code) form within the

.NET Framework (managed code) using the Platform Invoke

(PInvoke) technique.

II. How to use COM components (unmanaged code) from within the

.NET.

Start working on second category:-

In order to work with COM objects exposed by the Office applications

2003 Microsoft created a set of a primary interop assembly (PIA), Primary

interop assembly allows managed Visual C#.net to communicate with the

host application's COM-based object model, Visual studio Tools for the

Microsoft Office System uses PIA.

To get the correct assemblies referenced for Word, the name of assemblies

will vary based upon the version of Word that has, in this case, PIAs

provided by Visual Studio.NET 2003 and included in the Office 2003

family of products see (Table (3.7)). It mentions that working on Visual

Studio 2005 and Microsoft Word Document 2003 can't be Word reference.

The following Software must therefore be installed:-

Microsoft .NET Framework 1.1.

Microsoft office Word 2003 including the necessary Primary Interop

Assembly.

86


87

Visual Studio 2003.

To Referencing the Word assemblies follow these steps:

I. On the project menu, click Add Reference.

II. On the Com, locate Microsoft Word Object Library, and then

click Select.

III. Click OK in the Add References dialog box to accept your

selections as shown in figure (5.1).

Figure (5.1) word Reference

After references set up, we can begin using the Word components however;

these components are a little tricky to deal with and can act in unexpected

ways. These objects work by basically creating an instance of Word under

the current session and giving access to Word’s functionality.

In order to automate an application, we must know the object model that is

employed by the target application exporting activation objects. This

requires that the developer of the target application publicly document its

object model. The development of automation controllers without


knowledge of the target application's object model is "difficult to

impossible". Because of these complications, Automation components are

usually provided with type libraries which contain metadata about classes,

interfaces and other features exposed by an object library.

Microsoft has publicly documented the object model of all of the

applications in Microsoft Office, and some other software developers have

also documented the object models of their applications. Object models are

presented to automation controllers as type libraries.

The results of working with Word Object are explained bellow- for full

details about Word Object (See section 3.15 Word Object Model)

Modify Text Format in Document, count Characters in Document, and

modify Table Format in Document, Hide Text in Document and many

other processes.

These results could not access to Binary File Format but could call

Microsoft Word Document from Visual C# and there is no need to build

Text Editor for loading doc. File as customary since this is a new method.

To access Binary File Format we must get its specification from source

United State - Microsoft Company, and try by any way to find it by

corresponding many authorities and many web sites belonging to Microsoft

Developer Network (MSDN), to be able to get the answer. They want

Legal Agreement to supply us with Binary File Format I emit the FAX to

the company and supply us with Microsoft Office Word 97- 2007 Binary

File Format Specification. But to my surprise, it was very complex and

since it was a non- public format, it was supported by a few programs. To

get access to it from programming, we should depend on first category:

PInvoke technique (See section 3.16 Platform Invoke).

88


Accessing Unused Block is shown in figure (5.2). Having reached that, it

wasn't possible to access Compound Header of doc. File without entering

from Root using APIs (See section 3.17 Application Programming

Interfaces & section 3.18 Office Application Programming Interfaces).

This can use the following function:-

I. Structure storage API

StgOpenStorageEx function: opens an existing root object in

Compound files.

Note: all Windows 2000, Windows XP, and Windows Server 2003

applications should call StgOpenStorageEx, instead of StgOpenStorage.

The StgOpenStorage function is used for compatibility with Windows

2000 and earlier applications.

II. COM provides two interfaces to access compound file IStorage

and IStream.

IStorage Interface provides methods that can be performed on storage.

IStream Interface is used to read and write data to stream. Structure storage API

Root

Finding BAT

Offset BAT

Header

Main Stream

Table Stream

Istorage interface

Object pool

Istream Interface

Figure (5.2) Block Diagram for Unused Block Path in

document File

89


55..22 SSyysstteemm IImmpplleemmeennttaattiioonn In this section, the stages of the system will be discussed; these stages

are shown in Figure (5.3).

Figure (5.3) the main menu for the proposed system.

55..22..11 DDooccuummeenntt bbeeffoorree HHiiddiinngg

Having opened Cover Document file, the Tracking Change Tool will be

used to modify document to be like collaborative writing between many

authors. The cover document is shown in figure (5.4):

Figure (5.4) Cover Document before Track Change

90


When the button Document is clicked before hiding, it will open the window

in figure (5.5):

Figure (5.5) Cover Document after Track change

55..22..22 EEmmbbeeddddiinngg PPrroocceessss This stage is the primary stage in the proposed system. It describes the

implementation of embedding method in the following steps:

The first step: is to read the compound header, the FIB has a fixed length

of 1472 byte the first bytes of my Cover file are:

00000000:D0 CF 11 E0 A1 B1 1A E1 00 00 0000000000 00 00000010:00 00 00 00 00 00 00 00 3E 0003 00 FE FF09 00 00000020:06 00 00 00 00 00 0000 00 00 00 00 010000 00 00000030:3F 00 00 00 00 00 00 00 00 10 0000 4100 00 00 00000040:01 00 0000 FE FF FF FF 00 00 00 00 3E 00 00 00 00000000H D0 CF 11 E0 A1 B1 1A E1 00 00 00 00 00 00 00 00

1) 8 bytes containing the fixed compound document file identifier (magic number).

00000010H 00 00 00 00 00 00 00 00 3B 00 03 00 FE FF 09 00

2) 2 bytes containing the byte order identifier. It should always consist of the byte sequence FEH FFH.

91


00000010H 00 00 00 00 00 00 00 00 3B 00 03 00 FE FF 09 00

00000020H 06 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00

3) 2 bytes containing the size of sectors (small Block) or size of Block (big block) the size is 512 bytes, 2 bytes containing the size of short-sectors or size of small Block size is 64 bytes here.

00000020H 06 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00

4) 4 bytes containing the number of sectors used by the sector allocation table or Number of elements in the BAT uses only one sector or Block here.

00000030H 3f 00 00 00 00 00 00 00 00 10 00 00 41 00 00 00

5) 4 bytes containing the SecID of the first sector used by the directory or Block index of the first block of the property table. It starts at sector or Block 63 here

00000030H 3f 00 00 00 00 00 00 00 00 10 00 00 41 00 00 00

6) 4 bytes containing the minimum size of standard streams. This size is 00100000H = 4096 bytes here. This leads to the file stored in big Blocks and the main BAT is used to walk the big blocks making up the file.

00000040H 01 00 00 00 FE FF FF FF 00 00 00 00 3e 00 00 00

7) 4 bytes containing the Block index of BAT it starts at block 62 here. The second step: finding starting block of a file specified by the property (Directory) its size is 128 bytes: 00008000: 52 00 6F 00 6F 0074 00 20 00 45 00 6E 007400 00008010: 72 00 79 00 0000 00 0000 00 00 00 00 000000 00008020: 00 00 00 00 00 00 00 00 00 00 0000000000 00 00008030: 00 00 00 00 00 0000 00 00 00 00 000000 0000 00008040: 16 00 05 01 FF FF FF FF FF FF FF FF 03 0000 00 00008050: 06 09 02 00 000000 00 c0 00 0000 00 0000 46 00008060: 00 00 00 00 00 0000 00 00 00 00 00 e012 2f4e 00008070: b8 48 c9 01 42 0000 00 80 0000 0000 00 0000

92


00008000 52 00 6F00 6F 00 74 00 20 00 45 00 6E 00 7400 00008010 72 00 79 00 00 00 00 00 00 00 00 00 00 000000 00008020 00 00 00 00 00 00 00 0000 00 00000000 00 00 00008030 00 00 00 00 00 0000 00 00 00 00 00000000 00

1) 64 bytes containing the character array of the entry name (16-bit characters, terminated by the first <00> character. The name of this entry is "Root Entry" here.

00008070 b8 48 c9 01 42 0000 00 80 0000 000000 0000

2) 4 bytes containing starting block of the file, used as the first block in the file and the pointer is to the next block from the BAT.

The third step: Loading BAT array to accessing the Unused Block and hiding the secret message in it. The Block allocation Table will be for this cover 0 1 2 3 4 5 6 7 8 9 10 11 12 … … 1 2 3 4 5 6 7 8 9 10 11 12 … -1 …

The secret message will be:

THERE ARE ELEVEN GUARDS OUT TWENTY IN COUNTER AT

TEN PM FROM CELING OUR TARGET DIAMOND

The fourth step: Encoding the secret message with Huffman Coding:

0010 10000 000 0110 000 0011 0110 000 000 10011 000 11101 000 0101 11100 10101 0011 0110 10010 10001 0100 10101 0010 0010 11011 000 0101 0010 11001 0111 0101 10100 0100 10101 0101 0010 000 0110 0011 0010 0010 000 0101 11000 10110 10111 0110 0100 10110 10100 000 0111 10011 0111 0101 11100 0100 10101 0110 0010 0011 0110 11100 000 0010 10010 0111 0011 10110 0100 0101 10010

The fifth Step: hiding secret message in Unused Block.

93


When the embedding process button is clicked, it will open the window in

figure (5.6):

Figure (5.6) the Embedding Process Window

After writing the secret message, the Embed button must be pressed to hide

this message.

Button Exit will close the current form.

55..22..33 DDooccuummeenntt aafftteerr HHiiddiinngg:: This button will open the document after hiding a secret message as

shown in figure (5.7):

Figure (5.7) Document after Hiding

94


That message shows that this Document contains Tracking Change

information and will ask if you want to continue saving this information.

55..22..44 EExxttrraaccttiinngg PPrroocceessss This stage describes the implementation of Extracting method in the

following steps:

The first Step: is to read the compound header as in Embedding process.

The second step: finding starting block of a file.

The third step: Loading BAT array to accessing the Unused Block and

extract the secret message from it.

The fourth step: Decoding the secret message with Huffman Coding.

When the Extraction Process button is clicked, it will open the window in

figure (5.8):

Figure (5.8) Extracting Process Window

To extract the hidden data press button Extract.

Button Exit will close the current form.

95


55..33 CCoommppaarriissoonn bbeettwweeeenn tthhee pprrooppoosseedd SSyysstteemm aanndd tthhee mmoosstt

ppooppuullaarr TTeexxtt hhiiddiinngg MMeetthhooddss This work differs from other Text hiding Methods by the following: Table (5.1) Comparison between the proposed System and other Text hiding Methods

THE PROPOSED SYSTEM TEXT HIDING SYSTEMS 1. The difference between

document after hiding and Stegodocument which is opposite on apparent Text is not found.

The difference between Cover and Stegodocument which is opposite on apparent Text is found in hiding method like interline, inter Word

2. The hidden data is not related to Text Cover it can be English or Arabic Text.

Hidden data may be related to Text Cover.

3. No problem was detected on hidden data at Stegodocument mailing or copying.

Some programs like "send mails" may in advertently remove the extra space characters in space hidden data.

4. Must access Binary File Format that describes exactly how the data is to be encoded, how accessing to Unused Block to hiding data.

Does not need to know Binary File Format.

5. Using Track Change Tool does not affect hidden data.

This Tool has not yet been used in related work.

6. Could not be detected by the Software that detect any change with character Feature.

Can be detected by that Software

7. In this work, it was found that: Cover Size=34KB Hidden Size= 63Byte Informed about size of empty document = 10/11 KB

Taking the Open Space method, Inter-Sentence Spacing requires a great deal of text to encode a very few bits (one bit per sentence). This equates to a data rate of approximately one bit per 160 bytes assuming sentences are not on average to 80 character lines of Text.

96

66 CChhaapptteerr SSiixx

VVÉÉÇÇvvÄÄââáá||ÉÉÇÇáá

99 ffââzzzzxxááàà||ÉÉÇÇáá yyÉÉÜÜ YYââààââÜÜxx jjÉÉÜÜ~~

Chapter Six Conclusions and Suggestions for Future Work

CChhaapptteerr SSiixx

CCoonncclluussiioonnss aanndd SSuuggggeessttiioonnss ffoorr FFuuttuurree

WWoorrkk

6.1 Conche proposed System provides a new method for embedding Text

in Text, a number of conclusions were derived from this study:-

6.1 Conclluussiioonnss

I. The Cover Generation process in Hiding System will increase

the Security of Hidden System and avoid drawing suspicions

that there is hidden data.

TTII. Hidden data in Document will not be affected by copying or

mailing the Stegodocument.

III. The proposed system hides English Text in another Text

and gives good results.

IV. This method of hiding data in MCDFF is only a few of

many ways to hide or encrypt data.

V. The difference between the original Cover-Text size and

Cover –Text size after embedding process is acceptable, for

example in this Case :

Original cover-Text size is (34 KB),

Cover –Text size after embedding is (34.5 KB),

Hidden Message size (63 bytes),

The size of the empty Document is 10KB/ 11 KB.

97

Chapter Six Conclusions and Suggestions for Future Work

66..22 SSuuggggeessttiioonnss ffoorr FFuuttuurree WWoorrkk Information will be hidden more wonderfully in a creative way with

the rapid development of science and technology. Much more new methods

and new technologies will rise, and there will be bigger space for the

development of hidden information technology.

Many Suggestions can be given for future Work

I. The System could be modified to be implemented on other

Microsoft office files 2003 like Microsoft power point (.ppt),

Microsoft Excel (.xls), etc…

II. It is possible to use Encryption process before Embedding

process this will increase the Security of the system.

III. In MCDFF it is possible to use another digital warren

like Slack space.

IV. It is possible to use Compression process after Encoding

process to compress Huffman Coding.

V. It is possible to use secret key steganography instead of pure

steganography to implement the proposed system.

98

RReeffeerreenncceess

[ABD01]

Abdul Wahab, H., B.,"Information Hiding in written Text Using Context

Free Grammar (CFG) ", Msc. Thesis, University of Technology,

Department of Computer Science and Information System, Baghdad, 2001.

[ACK07]

Ackley, S., R.," Word File Format ", Apache POI – HWPF – Java API to

Handle Microsoft Word Files, 2007.

[ALD05]

Al-Dhao, T., A. and Rahma, S., A., "Analysis of Information Hiding

Techniques in the Text", Engineering & Technology Journal Vol. 24, No.

6, 2005.

[ALS01]

Al-Shamkhy, R., A.," Hiding Text in Text Using Dictionary Method"

MSc. Thesis, Department of Computer Science and Information System,

Baghdad, 2001.

[BAK05]

Baker, E., J., "Image Watermarking Using Coarseness and Wavelet

Transform", Msc. Thesis, University of Technology, Information Institute

for Postgraduate Studies, 2005.

[BER06]

Berghel, H., Hoelzer, D., and Sthultz, M., "Data Hiding Tactics for

Windows and Unix File System",

http://www.berghel.net/publications/data_hiding.php, May 26, 2006.

[BER05]

Bergman C. and Davidson J.," Unitary Embedding for Data Hiding with

the SVD", Security, Steganography, and Watermarking of Multimedia

Contents VII, SPIE Vol. 5681,San Jose, CA, Jan. 2005,URL:

http://orion.math.iastate.edu/cliff/manuscripts/svdstego.pdf

[CAC98]

Cacciaguerra, S., and Ferretti, S., "DATA HIDING:STEGANOGRAPHY

AND COPYRIGHT MARKING" ,

http:// www.cs.unibo.it/~scacciag/home_files/teach/datahiding.pdf, 1998.

[CHA00]

Chand, M., "Structure Storage: A COM way to read/write persistent

data",

http://www.dotnetheaven.com/Uploadfile/mahesh/_com104252005081250

AM/_com1.aspx?ArticleID=307eca4f-723b-4ed5-b823-2a05e71ai402,

June 26, 2000.

[CUM04]

Cummins J., Diskin P., Lau S. and Parlett R., "Steganography And

Digital Watermarking ", School of Computer Science, The University of

Birmingham , 2004.

[DAN07]

Daniel, R., "OpenOffice.org's Documentation of the Microsoft

Compound Document ", OpenOffice.org, the Speardsheet Project, June

2007.

[DIA08]

Dialogika, Makz, Math, Wk and Divo, "How to Retrieve Text from a

Binary .doc File", March 2008.

[DIC07]

Dickman, D., S., "An Overview of Steganography ", July 2007.

[DOB97]

Dr. Dobb's Journal, Jannary, "Steganography for Dos Programmers",

1997.

[DUC01]

Ducan, S., "An Introduction to Steganography", Intenet Surveys, 2001.

[DUN02]

Dunbar, B., "A detailed look at Steganographic Techniques and their use

in an Open-Systems Environment", SANS Institute, 2002.

[ETT98]

Ettinger, J., M., "Steganalysis and Game Equilibria", Information Hiding

Seconed International Workshop, Processing, And Vol.1525 of lecture

notes in Computer Science,Springer, and 1998 pp.319-328.

[FOL98]

Folk, M., J., Zoellick, B. and Riccardi, G., "File Structures an Object-

Oriented Approach with C+ +", ADDISON-WESLEY, 1998.

[GRA01]

Granor, T., E., "Session FT-Automating Microsoft Word", Automating

Microsoft Word Fox Teach 2001, page 38, 2001.

[HYU08]

Hyukdon K., Yeog K. and Sangjin L., "A Tool for Detection of Hidden

Data in Microsoft Compound Document File Format" , 2008

International Conference on Information Science and Security © 2008

IEEE , 2008.

[JAJ98]

Jajodia, S., and Johnson, N., F., "Steganalysis of Image Greated Using

Current Steganography Software", Information Hiding: Second

International Workshop, Processding, Vol.1525 of Lecture Notes in

Computer Science, Springier, 1998, PP.273-289.

[JIT06]

Jithra, K., "Microsoft Office Security, Part one",

http://www.securityfocus.com/infocus/1874, 2006-08-22.

[JOH01]

Johnson N. F., Duricn Z. and Jajodia S., "Information hiding:

steganography and watermarking attack and countermeasures", kluwer

Academic publishers, USA, 2001.

[JOH98]

Johnson, F. and Jajodia, S., "Steganalysis: The Investigation of Hidden

Information," in Proc. IEEE Information Technology Conf., Syracuse,

NY, Sep.1998, pp.113-116.

[JOH99]

Johnson, N.F., "Steganography ", an Internet Survey, 1999.

[KAH96]

Kahn D, "The History of Steganography ", Information Hiding: First

International Workshop. Proceedings, Vol. 1174 of Lecture Notes in

Computer Science, Springer, 1996, PP, 4-5.

[KAT00]

Katzenbeisser S. and Peticolas F., "Information Hiding Techniques for

Steganography and Digital Watermarking", Artech House Inc, USA,

2000.

[KHO05]

Khor, S., M. and Leonard, A., "Installing and Using the Office 2003

Primary Interop Assemblies", Microsoft Corporation, January 2005.

[KRE04]

Krenn, R., "Steganography: Implementation & Detection", found online

at

http://www.krenn.nl/univ/cry/steg/presentation/2004-01-21-presentation-

steganography.pdf, 2004.

[KUO70]

Kuo, F., F, "An introduction to error-correcting codes", 1970, PP 225-

231.

[LIU07]

Liu T.-Y. and Tsai W.-H., "A New Steganographic Method for Data

Hiding in Microsoft Word Documents by a Change Tracking Technique

", IEEE Transactions on Information Forensics And Security, Vol. 2, No. 1,

March 2007.

[MAR07]

Marc, J., "POIFS File System Internals", the Apache POI Project, the

Apache Software Foundation, 2007.

[MCC99]

McCarty, B., "Learning Debian GUN/LINUX", O'REILLY Online

Catalog, Chapter two, September 1999.

[MIC07]

Microsoft Open Specification Promise, "Microsoft Office Word 97-2007

Binary File Format (.doc) Specification", © 2007 Microsoft Corporation.

[MIC99]

Microsoft Crop., "OLE Concepts and Requirements Overview", October

1999.

[MIK07]

Mikhail, R., M., " Information Hiding Using Petri Nets and Wavelet

Transform " , Msc. Thesis, University of Technology, Department of

Computer Science and Information System, Baghdad, 2007.

[MIN06]

Ming, C., Ru, Z., Xinxin, N., and Yixian, Y., "Analysis of Current

Steganography tools: Classifications & Features", International

Conference on Intelligent Information Hiding and Multimedia Signal

Processing, © 2006 IEEE.

[MOR00]

Morkel T., Eloff J., and Olivier M., "An overview of image

Steganography", Information and Computer Security Architecture (ICSA)

Research Group, Department of Computer Science University of Pretoria,

2000, URL:

http://icsa.cs.up.ac.za/issa/2005/Proceedings/Full/098_Article.pdf.

[RAN03]

Randall, B., A., "Visual Studio Tools for the Microsoft Office System",

MCW Technologies, LLC, April 2003.

[RIM97]

Rimell J., "Data Hiding Inside TIFF Images", John's Collage,

Cambridge, England, 1997.

[ROC08]

Rocha, A. and Goldenstenin, S., "Information Hiding: types and

Applications ", IEEE WVU, Anchorgr-2008, 2008.

[ROM96]

Roman, S., "Introduction to Coding and Information Theory", 1996.

[SAL95]

Salomon, D., "Data Compression ", the complete reference, Springer,

PP.38-39, 1995.

[VIL06]

Villan, R., Voloshynovskiy, S., Koval, O.,Vila, J., Topak, E., Deguillaume,

F., Rytsar, Y., and Pun, T., "Text Data-Hiding for Digital and Printed

Documents: Theoretical and Practical Considerations", Computer Vision

and Multimedia Laboratory – University of Geneva, 2006.

[XIU06]

Xiuhui G., Renpu J., and Jiazhen W., "Research on Information Hiding ",

US-China Education Review, ISSN1548-6613, USA, Vol. 5, No. 3 (Serial

No. 18) May 2006.

[YOS06]

Yoshioka, K., Sonoda, K., and Takizawa, O., "Information Hiding on

Lossless Data Compression", International Conference on Intelligent

Information Hiding and Multimedia Signal Processing © 2006 IEEE, 2006.

Websites [Web01]

"A Closer Look at Platform Invoke",

Website: http://msdn.microsoft.com/en-us/library/0h9e9t7d(vs.71).aspx.

[Web02]

"Bravo (software)",

Website: http://en.wikipedia.org/wiki/Bravo_(software).

[Web03]

"File Format",

Website: http://en.wikipedia.org/wiki/File_format.

[Web04]

"How does Track Changes in Microsoft Word Work?"

Website:

http://www.shaunakelly.com/word/trackchanges/HowTrackChangesWork.

html.

[Web05]

"How to extract information from Office files by using Office file formats

and schemas",

Website: http://Support.microsoft.Com/kb/840817/en-us.

[Web06]

"Huffman Coding",

Website:

http://www.si.umich.edu/Classes/540/Readings/Encodings/Encoding%20-

%20Huffman%20Coding.htm.

[Web07]

"Microsoft Office",

Website: http://en.wikipedia.org/wiki/Microsoft_Office.

[Web08]

"Microsoft Word",

Website: http://en.wikipedia.org/wiki/Microsoft_Office_Word.

[Web09]

"Microsoft Word 5.0 (PC) Binary File Format",

Website: http://www.msxnet.org/word2rtf/formats/dosword5.

[Web10]

"OLE Automation",

Website: http://en.wikipedia.org/wiki/OLE_Automation.

[Web11]

"Structure of a Word document",

Website:

http://www,linguistics.ucsb.edu/facutly/cumming/WordForLinguists/Struct

ure.htm.

[Web12]

"What Is API",

Website:

http://msdn.microsoft.com/en-us/library/aa141380(office.10).aspx.

A

V

Appppeennddiixx

AA `||vvÜÜÉÉááÉÉyyàà jjÉÉÜÜww WWÉÉvvââÅÅxxÇÇàà VVÉÉääxxÜÜ

uuxxyyÉÉÜÜxx ggÜÜttvv~~áá V{{ttÇÇzzxx

AAppppeennddiixx BB

`||vvÜÜÉÉááÉÉyyàà jjÉÉÜÜww WWÉÉvvââÅÅxxÇÇàà VVÉÉääxxÜÜ

ttyyààxxÜÜ ggÜÜttvv~~áá VV{{ttÇÇzzxx

A

Appppeennddiixx

CC

;Y\U<

Appendix C Structure of File Information Block (FIB) [MIC07] In Word version 8, the FIB is reorganized to make future extension easier, and to make it easier to make backward compatible file format changes. The FIB now consists of four substructures: the header and three arrays. The FIB header, is unchanged from past versions. The second part is an array of 16-bit ―shorts, most of which were present in earlier versions in different locations. The third part is an array of 32-bit longs, many of which were scattered through the previous version FIB. Finally, there is an array of FC/LCB pairs, which were divided into several disjoint arrays in the previous FIB. Future versions of Word will add entries to the three arrays, so readers of the FIB must be careful to skip over any entries in each array that were not present in the version for which the reader was designed. Writers of the FIB must write exactly as many entries as was defined for the nFib value they put in the FIB. The FIBFCLCB structure, used in an array in the FIB: Deximal Hex Name Type Bitfield

Size Bitfield size

Comments Introduced

0 0x0000 Fc Long File position where data begins.

4 0x0004 Lcb ulong Sizeof Data.Ignore fc if lcb is zero

The FCPGDOLD structure, referenced in the FIB, used internally by Word: Deximal Hex Name Type Bitfield

Size Bitfield size

Comments Introduced

0 0x0000 FcPgd Long File position where data begins.

4 0x0004 LcbPgd ulong Sizeof Data.Ignore fc if lcb is zero

8 oxoooc fcBkd long File position where data begins.

12 0xoooc lcbBkd ulong Size of data.Ignore fc if lcb is zero

The FCPGD structure, referenced in the FIB, used internally by Word. This modified version of the above structure was introduced in Word 2003: Deximal Hex Name Type Bitfield

Size Bitfield size

Comments Introduced

0 0x0000 FcPgd Long File position where data begins.

Word 2003

4 0x0004 LcbPgd ulong Sizeof Data.Ignore fc if lcb is zero

Word 2003

8 oxoooc fcBkd long File position where data begins.

Word 2003

12 0xoooc lcbBkd ulong Size of data.Ignore fc if lcb is zero

Word 2003

16 0x0010 fcAfd Fc File position where data begins

Word 2003

ßaßaخخþþ@@––óó@@

االمن ليس مسؤولية أو أمتياز الحراس أو وآالء االمن فقط ,للفرد والمجتمع والعالم االمن مطلب

االمن اهتمام آل شخص حيث ان ابقاء الباب مغلق هي مسؤولية آل شخص يمر خالل ذلك الباب

.صنفه أو وضعه في الحياة,لونه , بغض النظر عن طوله

ا بحت ابح صال اص ب وات ع الوي ل مواق سبب آ ات ب ة البيان اث امني ب ابح ات قل اء البيان ث اخف

.الصوت والصورة وهكــــــــــــــــــــــذا,الشبكات يعتمد على الفيديو

ة االدراك دون اضعاف نوعي سرية بوسط رقمي ب تقنية أخفاء البيانات ممكن تخفي المعلومات ال

خاص ة االش ث بقي ط بحي ذلك الوس سي ل ذلك الح رية ب ات س ود معلوم درآوا بوج ن ان ي ال يمك

.الوسط

ة لنظام الحاسوب هذة االطروحة اقترحت طريقة لفن االخفاء باالستفادة من الخصائص الفيزياوي

ل ه ل ة خزن د ) .doc(وآيفي ف معق ه آمل ل ومعالجت ث فاي تخدمأبحي ةال تس ستخدمكتل ر م ة الغي

دة لملف مايكروسوفت ورد في الهيكلية الم الخفاء البيانات )ةالفارغ( تفادة من عق م االس ذلك ت وآ

.االمكانيات التي يوفرها برنامج مايكروسوفت ورد آأدواته لتوليد الغطاء

ة ي نص بطريق رح يخف ام المقت ين steganography النظ تخدام عمليت ر باس نص اخ ة : ب عملي

.عملية التضمين و الغطاء توليد

امج : عملية توليد الغطاء ائق برن 2003 اصدار ورد مايكوسوفت بما ان الغطاء هو وثيقة من وث

. انتاج جهود آتابة تعاونية بين عدة مؤلفين آانهبدور ليهظسي

ضمين ة الت صية با :عملي سلة ن ي سل ةلتخف ستخدم كتل ر م ة( ةالغي ة)الفارغ ة الثنائي ذلك بالهيكلي ل

.لملفا

ا ههذ اء ببرن ذي هو احد االطروحة قدمت نظام لالخف ورد وال ات مج ال نظام مايكروسوفت تطبيق

ي ةالمكتب ى بقي د االطالع عل ات وبع دنا ان التطبيق ه وج ات ب ة التطبيق اط ضعف عن بقي ل نق اق

.شورةنباالعتماد على اخر االبحاث الم

على نظام التشغيل وندوز اآس بي على حاسوب 2003هذا النظام نفذ باستخدام لغة السي شارب

. آيكا هيرتز 2.00 مع ذاآرة آيكا بايت ومعالج 4وع بينتوم محمول ن

<<<<<<<<<Ñ]†ÃÖ]<íè…çã¶< <

êÛ×ÃÖ]<ovfÖ]æ<êÖ^ÃÖ]<Üé×ÃjÖ]<ì…]‡æ<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <

íéqçÖçßÓjÖ]<íÃÚ^¢]<<<<<<<<< <

l^f‰^£]<Ýç×Â<ÜŠÎ<<<<<<<<< <

ý@óÕî‹ ý@óÕî‹ خخöbÐöbÐ@pbäbïjÜapbäbïjÜa@@ÀÀ@óïÝÙïè@@@óïÝÙïè@@†Šì@oÐíì‹Ùîb¾a@Ö÷bqì@†Šì@oÐíì‹Ùîb¾a@Ö÷bqì@@@obiobiخخa†a†@@ãã@@@@óïåÕmóïåÕm@@Êjnm@Êjnm@@@ïÍnÜa@ïÍnÜa@

رسالة مقدمة الى قسم علوم الحاسبات في الجامعة التكنولوجية وهي جزء من متطلبات نيل شهادة الماجستير في علوم الحاسبات

تقدمت بها

אאאא

باشراف

Ù]Ù]ددfÂ<…çjÒfÂ<…çjÒ{{ددß¹]<ß¹]<{{‘<ÜÃ‘<ÜÃ{{<çe]<^<çe]<^<éf<éf<<

View publication statsView publication stats

https://www.researchgate.net/publication/311969898