Batch Steganography in the Real Worldagents.fel.cvut.cz/stegodata/pdfs/Pev12-ACM.pdf · Batch Steganography, Anomaly Detection, Steganalysis, Em-bedding Strategies 1. INTRODUCTION

Batch Steganography in the Real World

Andrew D. KerDepartment of Computer Science

University of OxfordParks Road

Oxford OX1 3QD, UK

[email protected]

Tomáš PevnýAgent Technology Center

Czech Technical University in PragueKarlovo namesti 13

121 35 Prague 2, Czech Republic

[email protected]

ABSTRACT

We examine the universal pooled steganalyzer of [15] in tworespects. First, we confirm that the method is applicable toa number of different steganographic embedding methods.Second, we consider the converse problem of how to spreadpayload between multiple covers, by testing different pay-load allocation strategies against the universal steganalyzer.We focus on practical options which can be implementedwithout new software or expert knowledge, and we test onreal-world data. Concentration of payload into the minimalnumber of covers is consistently the least detectable option.We present additional investigations which explain this phe-nomenon, uncovering a nonlinear relationship between em-bedding distortion and payload. We conjecture that thisis an unavoidable consequence of blind steganalysis. Thisis significant for both batch steganography and pooled ste-ganalysis.

Categories and Subject Descriptors

D.2.11 [Software Engineering]: Software Architectures—information hiding

General Terms

Security, Algorithms

Keywords

Batch Steganography, Anomaly Detection, Steganalysis, Em-bedding Strategies

1. INTRODUCTIONSteganography and steganalysis have been mainly con-

cerned with simple abstractions of the data hiding problem,focusing on embedding of payload in one cover, or detec-tion of payload in one object belonging to one user. Theproblems of batch steganography – how best to spread pay-load between multiple covers – and pooled steganalysis – how

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.MM&Sec’12, September 6–7, 2012, Coventry, United Kingdom.Copyright 2012 ACM 978-1-4503-1418-3/12/09 ...$10.00.

to pool evidence from multiple objects of suspicion – wereposed in 2006 [11]. At the time it was proposed that anessentially game-theoretic situation would occur, with theoptimal choice of payload spreading depending on the oppo-nent’s method for amalgamating evidence, and vice versa.Until recently there was little published research on thesetopics, but early analyses [11, 12] indicated that, for theembedder, the optimal behaviour is likely to be extreme con-centration of payload into as few covers as possible, or theopposite in which payload is spread as thinly as possible.However, these theoretical results could not be confirmedwithout practical pooled steganalyzers to test against.

In 2011-12 [14, 15] we demonstrated a method for pooledsteganalysis which treats actors (users who transmit coveror stego objects) as the unit of classification, measures dis-tance between actors as a distributional difference betweenthe feature clouds of their transmitted objects, and then ap-plies outlier analysis to detect guilty (steganography-using)actors. Now that the literature is finally equipped with atleast one practical method for pooled steganalysis, we canlook for practical methods for batch steganography. Webriefly discuss theoretical approaches to this problem in sub-section 1.1.

The work reported in this paper was motivated by twoaims. First, to extend the experimental evidence base forthe pooled steganalyzer of [15], which so far has only beentested against one steganographic algorithm (nsF5), by test-ing against additional steganographic embedding algorithms,focusing on those with existing implementations which makethem usable by a non-expert. Second, to test different em-bedding strategies1 , again with a focus on embedding strate-gies which could be used by a non-expert. After brieflysummarising the blind pooled steganalyzer (section 2) wepresent the design of our experiments in section 3 and theresults in section 4.

The authors found the consistency of the results, whichfavour concentration of payload in as few covers as possible,surprising. We performed further investigations to under-stand why they arise, which highlighted the significance ofan embedding distortion which, for individual images mea-sured by a popular steganalytic feature set, is nonlinear withrespect to payload. Furthermore, feature preprocessing suchas normalization or whitening makes a significant differenceto this effect. These results are reported in section 5. Fi-

1To avoid confusion, the terminology embedding algorithm isused for the steganographic method for embedding in indi-vidual objects, and embedding strategy for the batch problemof how payload is allocated amongst multiple covers.

nally, section 6 concludes the paper with a discussion ofthese insights, and their importance for the design of bothembedding strategies and future pooled steganalysis.

1.1 Optimal Embedding StrategyWhen the batch steganography problem was presented,

some theoretical approaches were tried. In [11] the detectoris modelled as a single real output for each image (e.g. quan-titative steganalyzer), combined into a pooled steganalysisusing a number of approaches. In a rather restricted frame-work it is shown that the optimal embedding strategy is oneof the two extremes: concentration of payload into the fewestnumber of covers, or spreading paying equally amongst allcovers. A similar result is obtained in [12], for additionalpooling methods, and another related result appears in [13].All of these are limited in scope, and in particular it is as-sumed that the covers are homogeneous (equal capacity andsensitivity to embedding). This is not the case for the realworld.

We can make use of more recent literature to determinean embedding strategy which is either optimal (if the defini-tion of optimality is in terms of a distortion function) or atleast near-optimal (if defined in terms of detectability) us-ing adaptive embedding [4, 18]. Designed for embedding ina single image (or potentially a cover of any medium), adap-tive embedding aims to concentrate the changes caused byembedding in those areas where they will be least detectable:in practice, in noisy and edge regions of images. Given afunction which measures distortion, embedding which mini-mizes distortion must be located randomly following a Gibbsdistribution, and the close approximation to this Gibbs em-bedding can be obtained using Trellis coding [2, 3, 5].

Given a set of cover images, in principle it should be pos-sible to extend the distortion function to measure distortionin multiple images (perhaps by simple summation, perhapsby a more sophisticated definition) and apply Gibbs em-bedding to the entire set of images, performing at a strokeboth batch allocation and embedding using a strategy whichoptimizes distortion. However, this is not straightforward:the definition of distortion should not necessarily be simplesummation, particularly if the cover images are of differentsizes or very heterogeneous; there is the underlying limi-tation that the optimality of the embedding depends on aclose relationship between distortion and detectability; fi-nally, there is as-yet no available implementation of Gibbsembedding in multiple images.

For this paper, it is the final limitation which causes us toexclude Gibbs embedding. We have chosen to focus on real-world steganography, such as would be possible by a non-expert using tools available today, addressing the followingquestion: how would we advise a non-expert to hide data inmultiple objects? This restricts us to embedding softwarewhich works on single images at a time (because no batchembedding has yet been implemented), to embedding algo-rithms that have available implementations, and to a set ofcover images obtained from a real-world source. We assumethat the embedder can use existing tools to determine themaximum capacity of individual images (see subsection 3.1)and manually split the payload into segments of their chosensize2. We can benchmark simple options against the only ex-

2We will not consider how the sender informs the recipientof the lengths of each segment, or which image corresponds

isting pooled steganalysis method, which is described in thefollowing section.

2. BLIND UNIVERSAL POOLED

STEGANALYSISSuppose that multiple actors each transmit multiple ob-

jects, all of which have been intercepted by the steganalyst.The steganalyst is assumed to know which actor sent whichobject. We assume that each actor used a source of cover ob-jects, different from sources used by other actors. We do notrequire the steganalyst to have access to these sources, whichmakes it difficult to train traditional steganalysis methods.The steganalyst’s aim is to identify a guilty actor or actors,who use steganography in some (not necessarily all) of theirtransmitted objects.

The first published method [14] for this scenario relied onhierarchical clustering. Its effectiveness was tested in labo-ratory conditions by simulating up to 13 actors by differentdigital cameras. Subsequent work [15] simulated realisticconditions by using a social network dataset with thousandsof actors. The same work also proposed to replace hierarchi-cal clustering by an outlier detection algorithm (local outlierfactor [1]), as it argued that guilty actors represent outliersrather then tight clusters in the feature space. The lattermethod, used in this paper, works as follows.

First, we extract features from all available images. Inprinciple, any steganalytic feature set might work, if sen-sitive to the steganographic algorithm used by guilty ac-tors. (More precisely, it should be more sensitive to stegano-graphic alteration than to innocent image processing opera-tions.) In experiments presented in this paper, we have usedthe PF-274 feature set [19], since it offers good detection ac-curacy against the tested steganographic algorithms; it alsohas relatively low dimension. After extraction, the featuresare normalized to zero mean, unit variance, and optionallywhitened and renormalized.

Second, we group the extracted features by actor, andcalculate distances between all pairs of actors by using max-imum mean discrepancy (MMD) [8]. In experiments pre-sented here, we use the linear kernel for MMD, which corre-sponds to the L2-distance between the mean feature vectorsof individual actors. This choice is based on the experimentspresented in [14]. The advantage of pooling objects fromeach actor is that the method improves the signal-to-noiseratio compared with working on individual objects. Thisleads to better accuracy in the identification of the guiltyactor.

Third, we rank actors based on their pairwise distancescalculated in the previous step. To do so, we use the localoutlier factor method (LOF), because it has favourable prop-erties for our application: it requires only pairwise distancesbetween points (actors); it is application agnostic; it pro-vides a measure of how much an outlier each point (actor)is; it relies on single hyper-parameter, to which the methodis not particularly sensitive.

The LOF method provides a measure of being outlier forevery actor, but does not provide a threshold, above whichthe actor should be considered to be an outlier. We simplyuse the LOF values to rank the actors according to their

to which segment. It could be part of the shared secret keyor a small header.

guiltiness. The evaluation criterion measures how often isguilty actor among the top x% most suspicious actors.

Some details of the LOF method are briefly described inAppendix A.

3. EXPERIMENTAL DESIGNIn this paper we evaluate the blind universal pooled ste-

ganalyzer in a variety of situations, testing different combi-nations of the following parameters: total number of actors,number of images per actor, embedding algorithm, embed-ding strategy, and total payload. For all the experiments thetrue number of guilty actors was fixed at one, for which theanomaly detector of [15] is best suited. The options for theother parameters are described in the following subsections.

3.1 Embedding AlgorithmsSearching the internet for steganographic algorithms for

JPEG images that can be readily used by ordinary users,we have found just five: F5 [24, 25], JPHide&Seek [17],Steghide [9, 10], OutGuess [21, 22], and JSteg [23].

OutGuess [21] is an adaptation of LSB replacement forJPEG images. It modifies DCT coefficients and skips coeffi-cients equal to zero or one to avoid visible distortions due tochanging zeros to ones. To evade detection based on first-order statistics, OutGuess saves half of the DCT coefficientsfor a statistical restoration phase, during which it tries toreconstruct the histogram of DCT coefficients of the origi-nal cover image. It is similar to JSteg [23], which we didnot test because it is very weak and even more detectablethan OutGuess [20].

F5 [24] tries to preserve the shape of the histogram ofDCT coefficients. It encodes the message into LSBs of DCTcoefficients, but instead of replacing the LSBs, it decreasesthe absolute values of DCT coefficients if their LSBs doesnot match the message. The F5 algorithm does not use zerosfor embedding, and if a DCT coefficient is changed to zeroduring embedding then the message bit is re-embedded againutilizing another coefficient. This feature increases numberof zeros in the stego image, which is called shrinkage. F5also uses matrix embedding, a coding scheme that increasesthe embedding efficiency (the number of bits embedded perembedding change), reducing embedding distortion for smallpayloads.

Steghide [10] also tries to preserve first-order statisticslike OutGuess, but without resorting to a statistical restora-tion phase. Steghide creates a graph-like structure with ver-tices representing (groups of) coefficients that need to bechanged. Edges between vertices represents modifications inboth groups, such that if performed, both vertices code themessage. The edges are also assigned weights based on dis-tortion of visual similarity caused by an embedding change.During embedding, the algorithm tries to find vertex match-ing in the graph, minimizing the distortion while coding themessage.

Although the implementation in C of the JPHide&Seekalgorithms are available, a higher-level description of itsfunction is not known to the authors.

Besides the above four algorithms, we have included thensF5 algorithm, for which only a simulator has been re-leased [16]. Although this is not in accord with our goal toinvestigate the security of practically available algorithms,the algorithm was included due to its popularity in the re-search community. The nsF5 algorithm uses the same em-

bedding operation as F5, but replaces matrix embeddingbased on Hamming codes with wet paper codes with im-proved efficiency [6] to remove the shrinkage effect. Notethat we used the 2008 version of the nsF5 simulator, whichis not the same as the most recent version currently pub-lished by the author: the current version simulates a higherembedding efficiency.

3.2 Embedding StrategiesRecall the goal of batch steganography: the steganogra-

pher wants to spread a message of total length M amongimages (X1, . . . , Xn) with capacities (c1, . . . , cn), using asteganographic embedding algorithm for individual images.(We will discuss how (c1, . . . , cn) are determined in the fol-lowing subsection.) We have identified five strategies to de-termine the message fragment lengths (m1, . . . ,mn), withM =

∑n

i=1mi, to be embedded into the images.

In the max-greedy strategy, the steganographer wantsto embed the message into the fewest possible number ofcovers. During embedding, he iteratively chooses the coverswith highest capacity yet to be used, and embeds a portionof the message equal to the capacity of the image. Assumingthat the images are ordered by capacity c1 ≥ c2 ≥ . . . ≥ cn,this leads to the following for the message lengths

mi = ci, ∀i ∈ {1, . . . , I − 1},

mI = M −I−1∑

i=1

mi,

mi = 0, ∀i ∈ {I + 1, . . . , n},

where I denotes the smallest possible number of images withsufficient capacity, i.e.

I = argmini

M ≤i∑

j=1

cj .

The max-random strategy is the same as max-greedy,except that the covers used for embedding are chosen in arandom order. Consequently, the number of utilized imagescan be higher then in the max-greedy strategy. This sim-ulates a steganographer who is using a strategy of concen-trating payload, but is not taking individual capacity intoaccount in advance.

In the linear strategy, the message is distributed intoall available covers proportionately to their capacity. Thismeans that

mi =ciM

∑n

j=1cj

.

(Fractional bits are ignored in this study.)In the even strategy, the message is distributed evenly

into all available covers regardless to their capacity. Thus

mi =M

n.

Executing this strategy it can happen that for some images,the message length mi exceeds their capacity, ci. In thesecases, we set mi = ci and recalculate an even message lengthfor the remaining images.

In the sqroot strategy, the message is again spread amongall images with the length of the fragments being propor-

Mean Mean Number ofestimated relative zero capacitycapacity capacity covers

nsF5 53192 0.80 1265F5 38427 0.57 0JPHide&Seek 22133 0.33 0Steghide 25249 0.38 3161OutGuess 25654 0.38 121

Table 1: Average capacity of images in the socialnetwork image database, for the steganographic al-gorithms used in this paper. Capacities (bits) wereestimated by the algorithm described in Section 3.3.The relative capacity (bpnc) is defined as the esti-mated capacity divided by the number of non-zeroDCT coefficients in the image. Number of zero ca-pacity images is out of 800 000.

tional to the square root of their capacities, i.e

mi =

√ciM

∑n

i=j

√cj

.

In our preliminary experiments we found that this so fre-quently exceeded the capacities that we discarded the strat-egy. (It is, in any case, based on a misreading of the squareroot law of capacity.)

3.3 Estimation of capacitiesThe embedding strategies described above assures the ste-

ganographer to know the maximum length (capacity ci) hecan embed into a particular image with the chosen embed-ding algorithm. In fact such a maximum is not always well-defined, since capacity can depend on content. Followingour aim of simulating real-world practical steganography, weestimate the maximum message length for each embeddingalgorithm, and each cover image, as follows.

First, we query the implementation of the algorithm toprovide an initial estimate of the maximum message length.This is done either by embedding a very short message intothe given image (implementations of F5, JPHide&Seek, andOutGuess print the estimated capacity to the console uponembedding), or asking for information about a given image(the implementation of the Steghide). The implementationof nsF5 does not provide a capacity estimate, so we set it to0.8 · nc, where nc stands for the number of non-zero DCTcoefficients.3

Once we have an initial estimate of capacity, we try toembed a randomly generated message of this length. If theembedding fails, the estimate of the capacity is decreasedby 10 bytes and the procedure is repeated. Otherwise, wedeem the current estimate of the capacity as final. We al-low a maximum of 100 repetitions, after which the image isdeemed as not suitable for the embedding and its capacityis set to zero.

The purpose of the last step is threefold: (a) it refinesthe initial estimates, (b) it verifies that the message can beactually embedded, and (c) it discards singular images (suchas night pictures, blue skies, etc.).

3This capacity estimate of nsF5 is based on discussions withthe author and our experiences from previous work; it doesnot seem to have been published.

In practice, the actual message length may also depend oncontents of the the message itself, and on the steganographickey, because of slightly varying correlation between the coverand the message. To circumvent unpredictable behaviour,we decreased all capacity estimates to 90% of the original.Our capacity estimates are thus strictly conservative, butthis should affect the results only slightly, as there shouldbe little difference between embedding at total capacity andnear-total capacity. The mean capacities, relative capacities,and number of zero capacity images (out of 800 000 in ourimage set, see below) are displayed in Table 1.

3.4 Real-World ImagesOur experiments were performed on a highly realistic data

set, obtained from a leading social networking site. Sincethe process of creating this data set has already been de-scribed in [15], we recapitulate it here briefly. All imagesuploaded to a popular social network and made publiclyvisible, by users who identified themselves with member-ship of Oxford University. After downloading by followingpublic links, the files were anonymized and no personally-identifiable information was retained. All images have thesame format (JPEG), with the same quality factor (85), andapproximately the same size (1Mpix). Nonetheless they arevery hard to steganalyze, because of an unknown and prob-ably wild processing history: the social network automati-cally resizes and re-compresses uploaded images, followingpotential image processing operations between the cameraand upload.

The different users will be using different cameras (po-tentially more than one each), which makes the set hetero-geneous. Some images are not even natural photographs:they include montages, images with captions, or entirelysynthetic advertisements. In a realistic scenario, we mustdeal with this type of difficult data. Pooled steganalysis canhelp to amplify a stego signal, but we must expect a certainvariation between innocent actors.

For the experiments, we have used subset of this database,selecting exactly 200 photos from each of 4000 users (“ac-tors”), who after anonymization are known only by an in-teger 1–4000. These 800 000 images form the data set usedin this paper. It is a highly realistic image set because it isexactly the sort of media used on the internet in large socialnetworks.

3.5 Experimental protocolBy using the social network data set, we simulate the

scenario of monitoring a network and identifying the guiltyusers. We vary the number of actors, na ∈ {100, 400, 1600},and number of images per actor, ni ∈ {10, 20, 50, 100}. Theactors and images were always selected randomly. All exper-iments discussed in Section 4 follow the protocol describedhere.

First, we must determine the size of the entire messageM . Although its maximum depends on the embedding al-gorithm, in order to make like-for-like comparisons betweendifferent algorithms we need a fixed reference point. Sincethe information in a JPEG image is the nonzero coefficients,we measure payload relative to the total number of nonzeroDCT coefficients. These are commonly referred to by theacronym “nc”, hence we will measure total payload as thenumber of bits per nonzero coefficient, bpnc. Bearing inmind that algorithms such as OutGuess and Steghide can-

not reliably embed more than about 0.25 bpnc in a singleimage, we will choose payloads only up to this level. Thisalso ensures that the message length does not exceed thetotal capacity, M ≤∑n

i=1ci.

Then we can perform what we call “one experiment”:

1. Randomly select na actors and ni images per actor.

2. Randomly choose a guilty actor and embed randompayload measuring p bpnc into his images using thechosen embedding strategy and embedding algorithm.The message length embedded into the images is equalto pN , where N is the total number of all non-zeroDCT coefficients in the actors’ images.

3. Extract features from images of all actors.

4. Normalize each feature to zero mean and unit variance.

5. Whiten the set of features to decorrelate them, discard-ing trivial components corresponding to eigenvalues ofless than 0.01. This is achieved using the PrincipalComponent Transform (PCT), after which the featuresare renormalized to unit variance. (The whitening stepmay be omitted.)

6. Group extracted features by actor.

7. Calculate distances between actors (linear MMD, i.e.the distances between the mean feature vector of eachactor).

8. Calculate LOF values of every actor. The number ofnearest neighbours, k, was set to 10, as in the publica-tion [15].

For every combination of parameters, we have repeatedthe experiment 500 times (except 250 times for F5, which hasa slow Java implementation). We must decide on a metricfor success of the steganalyzer: we take the proportion ofexperiments in which the guilty actor was ranked in the top5% most suspicious.

There are many combinations of parameters making upone experiment, each requiring the examination of hundredsof thousands of images. Performing all the experiments tookapproximately 230 core days, distributed over a small clus-ter.

4. RESULTSFigure 1 displays the experimental results for ni = 100

images per actor. The array of charts varies the number ofactors (100, 400, 1600, horizontally) and embedding algo-rithm (vertically). Within each chart, the x-axis gives thepayload in bpnc, and the y-axis the rate of success for thesteganalyzer, when the guilty actor was ranked in the top5% by LOF. The different embedding strategies are denotedby different point types.

There are some ways in which these charts show variation:larger payloads are unsurprisingly more detectable but thedifferent embedding algorithms are of different security (ofwhich more shortly), and it is somewhat easier to find aguilty actor in the top 5% of 1600 than the top 5% of 100(a result replicating some in [15]). But most striking is theconsistency between all these experiments: for any givenpayload size, where detection is not around random or per-fect, the even embedding strategy is most detectable, the

linear strategy next most detectable, and the max strategiesleast detectable. Max-greedy is less detectable than max-random. Similar consistency is seen with ni = 50, 20, andeven ni = 10 although detection rates are much lower whenthere is so much less evidence available, and if the perfor-mance metric is changed to count the guilty actor in the topn most suspicious rather than top x%. We will not bore thereader with displays of such charts.

This represents an important lesson for a batch stegano-grapher: it is substantially more secure to allocate payloadinto the smallest total number of covers, picking largest ca-pacity covers and filling them completely. Depending on thedetector and desired security, up to approximately twice aslarge a payload can be embedded using the max strategiesthan the linear or even strategies. Another way of looking atthis data is to conclude that the blind universal steganalyzerhas a weakness, which can be exploited by the embedderchoosing the max-greedy strategy.

One other item we highlight from Figure 1 is the curiousbehaviour of JPHide&Seek: there is above-random detectioneven for the smallest payloads when even or linear strategiesare used for embedding. It seems likely that JPHide&Seekembeds either a minimal message length or includes somesort of header, making it even more desirable to use as fewimages as possible for stego payload.

We make a comparison of the security of the different em-bedding algorithms, assuming that the best strategy (max-greedy) is used, in Figure 2. This confirms results seen in ste-ganalysis of individual images, for example as in [7], wherensF5 is found to be the most secure choice, followed by F5and JPHide&Seek, with Steghide substantially less secureand OutGuess the worst.

5. FURTHER INVESTIGATIONSWe want to understand why the results of section 4 occur.

We focus on three particular questions:

(1) Why do the strategies which concentrate payload (max-greedy and max-random) evade detection substantially bet-ter than those which spread payload (linear and even)?

(2) Why do the strategies which use image capacity in theallocation (max-greedy and linear) evade detection betterthan those which do not (max-random and even)?

(3) Do the results represent a weakness of the blind universalpooled steganalyzer, which can be fixed?

First, we can examine the problem by recalling that thesteganalyzer depends only on the L2-distance between thecentroids of each actor’s feature cloud (the PF-274 featuresextracted from the entire set of images they transmit). Whatmatters to detectability, apart from the inner workings ofthe LOF calculation, is how the the guilty actor’s centroidis affected by embedding.

Let us fix a guilty actor, and write v1, . . . , vn for the fea-ture vectors which are extracted from the cover images theyuse. Correspondingly, let us write v′1, . . . , v

′

n for the fea-ture vectors extracted from their transmitted objects. Wecontinue to write m1, . . . ,mn for the payload allocated intoeach image by the guilty actor’s strategy. Thus if mi = 0then v′i = vi, otherwise very likely v′i 6= vi. Write v andv′ for the guilty actor’s centroid of cover and stego imagesrespectively.

100 actors 400 actors 1600 actorsnsF

5

payload (bpnc)

guilt

y in

top

5%

0.05 0.1 0.15 0.2 0.25

010

0

payload (bpnc)

guilt

y in

top

5%

0.05 0.1 0.15 0.2 0.25

010

0

payload (bpnc)

guilt

y in

top

5%

0.05 0.1 0.15 0.2 0.25

010

0

F5

payload (bpnc)

guilt

y in

top

5%

0.05 0.1 0.15 0.2 0.25

010

0

payload (bpnc)

guilt

y in

top

5%

0.05 0.1 0.15 0.2 0.25

010

0

payload (bpnc)

guilt

y in

top

5%

0.05 0.1 0.15 0.2 0.25

010

0

JPHide&Seek

payload (bpnc)

guilt

y in

top

5%

0.05 0.1 0.15 0.2 0.25

010

0

payload (bpnc)

guilt

y in

top

5%

0.05 0.1 0.15 0.2 0.25

010

0

payload (bpnc)

guilt

y in

top

5%

0.05 0.1 0.15 0.2 0.25

010

0

Steghide

payload (bpnc)

guilt

y in

top

5%

0.05 0.1 0.15 0.2 0.25

010

0

payload (bpnc)

guilt

y in

top

5%

0.05 0.1 0.15 0.2 0.25

010

0

payload (bpnc)

guilt

y in

top

5%

0.05 0.1 0.15 0.2 0.25

010

0

OutG

uess

payload (bpnc)

guilt

y in

top

5%

0.05 0.1 0.15 0.2 0.25

010

0

payload (bpnc)

guilt

y in

top

5%

0.05 0.1 0.15 0.2 0.25

010

0

payload (bpnc)

guilt

y in

top

5%

0.05 0.1 0.15 0.2 0.25

010

0

Figure 1: Accuracy of the blind pooled steganalyzer, for five different embedding algorithms (charts verti-cally), number of actors (charts horizontally), total payloads in bpnc (x-axis of each chart), and embeddingstrategy: max-greedy (•), max-random (◦), linear (�), even (�). The y-axis is the proportion of experimentsin which the true guilty actor was ranked in the top 5% most suspicious actors. In all these charts, thenumber of images per actor ni = 100.

100 actors 400 actors 1600 actorsmax-greedy

payload (bpnc)

guilt

y in

top

5%

0.05 0.1 0.15 0.2 0.25

010

0

payload (bpnc)

guilt

y in

top

5%

0.05 0.1 0.15 0.2 0.25

010

0

payload (bpnc)

guilt

y in

top

5%

0.05 0.1 0.15 0.2 0.25

010

0

Figure 2: Comparison of different embedding algorithms, which follow the same colours and line style asFigure 1.

Despite copious literature making use of these (and other)features for classification of payload, we know little abouthow payload size effects features. We know that it does affectfeatures, because steganography is detectable, and from [20]we know that it is possible to construct an estimator for the(relative) payload length from the stego features. In [20]it is shown that even an ordinary least-squares estimatorhas quite good performance, which would suggest a broadlylinear relationship between features and payload size.

So suppose that, abusing mathematical notation to reasonroughly about average distortion, the effect of payload onfeatures is approximately linear, i.e.

v′i ≈ vi +miδv

where δv is a direction in which stego objects tend to move,then

v′ ≈ v + δv∑

mi.

This of course neglects many details, including if stego ob-jects move in different directions or at different rates depend-ing on the cover or message (though this should all washout on average). However, it at least leads us to expect thatall embedding strategies should be equally detectable, as theycause equal distortion to the guilty actor’s centroid whenthe total payload is fixed.

Since this is manifestly not the case, our first investiga-tion into the cause of (1) and (2) is to test the assumptionof a linear relationship between payload and distortion inindividual images. We picked an embedding algorithm anda random cover image, and computed three feature vectors:

· v0, the cover features;

· v0.1, features from the image with a fixed payload of 10%of its capacity;

· vp, features from the image with a random-length payload,proportion p of its capacity.

On the left of Figure 3 we display scatterplots of

‖vp − v0‖‖v0.1 − v0‖

(1)

against p. If the relationship between feature change and(relative) payload is linear then we would expect to see astraight line. The use of a fixed payload in the denominator

may create an artificial “pinch”near 0.1, but means that theresults can be displayed with a comparable y-axis.

We observe some sort of linear fit, and no sign of a sig-nificantly nonlinear fit, although the extent of the relation-ship depends on the embedding algorithm. (It shows thatthe least-squares quantitative estimator is doing some work,weighting towards relevant features and away from noisyfeatures, in order to have such good performance displayedin [20].) We do not see a nonlinear relationship.

The authors found this puzzling, because it does not fitwith the observed behaviour of the different batch embed-ding strategies. But then we remembered that the LOFdetector works on features which, in order to equalize theirweight and remove correlation, have been normalized, white-ned, and then renormalized (we call this“normwhite space”).So we performed an identical experiment but computing thenorms of (1) on features which are first subject to the sameoperations. These are displayed in the middle of Figure 3.

This time the results are markedly different. There is astrong nonlinearity, with larger payloads causing proportion-ately less distortion than smaller ones. Observing the val-ues on the y-axes, we see that payloads of near-full capacitycause distortion only about 3–4 times as large as payloadat 10% of capacity. This phenomenon directly explains whyconcentration of payload in few covers is the best type ofbatch steganography against our universal pooled detector,since the total distortion “cost” is much lower thanks to thesublinear relationship.

How is it possible for the left and middle columns of Fig-ure 3 to be so different, and how is a nonlinear relationshipcompatible with the workings of linear least squares regres-sion in [20]? To answer the first, consider that normalizationis a shear transform geometrically and (unlike PCT, whichis orthonormal) can magnify or reduce the angles betweenvectors; vectors which are nearly parallel (such as stego dis-tortion in these images, a fact which we verified but do notinclude here for reasons of space) can become much less par-allel when subject to shear, and that is exactly what happensto stego features. To answer the second, we must simply ex-pect that the regression manages to find a linear transformwhich does the opposite, making stego distortion nearly par-allel and thus reducing the nonlinear effect of payload.

We performed a number of investigations to try to ex-plain the nonlinear relationship between (whitened, normal-ized) features and payload. They revealed that, perhaps

nsF

5

0.0 0.2 0.4 0.6 0.8

05

1015

relative payload

dist

ortio

n qu

otie

nt (

raw

feat

ures

)

0.0 0.2 0.4 0.6 0.8

01

23

4

relative payload

dist

ortio

n qu

otie

nt (

norm

whi

te s

pace

)

0 30 60 90 120

0.00

00.

004

0.00

80.

012

capacity (Kb)

sens

itivi

ty (

norm

whi

te s

pace

)

F5

0.0 0.2 0.4 0.6 0.8

05

1015

relative payload

dist

ortio

n qu

otie

nt (

raw

feat

ures

)

0.0 0.2 0.4 0.6 0.8

01

23

4

relative payload

dist

ortio

n qu

otie

nt (

norm

whi

te s

pace

)

0 30 60 90 120

0.00

00.

004

0.00

80.

012

capacity (Kb)

sens

itivi

ty (

norm

whi

te s

pace

)

JPHide&Seek

0.0 0.2 0.4 0.6 0.8

05

1015

relative payload

dist

ortio

n qu

otie

nt (

raw

feat

ures

)

0.0 0.2 0.4 0.6 0.8

01

23

4

relative payload

dist

ortio

n qu

otie

nt (

norm

whi

te s

pace

)

0 30 60 90 1200.

000

0.00

40.

008

0.01

2

capacity (Kb)

sens

itivi

ty (

norm

whi

te s

pace

)

Steghide

0.0 0.2 0.4 0.6 0.8

05

1015

relative payload

dist

ortio

n qu

otie

nt (

raw

feat

ures

)

0.0 0.2 0.4 0.6 0.8

01

23

4

relative payload

dist

ortio

n qu

otie

nt (

norm

whi

te s

pace

)

0 30 60 90 120

0.00

00.

004

0.00

80.

012

capacity (Kb)

sens

itivi

ty (

norm

whi

te s

pace

)

OutG

uess

0.0 0.2 0.4 0.6 0.8

05

1015

relative payload

dist

ortio

n qu

otie

nt (

raw

feat

ures

)

0.0 0.2 0.4 0.6 0.8

01

23

4

relative payload

dist

ortio

n qu

otie

nt (

norm

whi

te s

pace

)

0 30 60 90 120

0.00

00.

004

0.00

80.

012

capacity (Kb)

sens

itivi

ty (

norm

whi

te s

pace

)

Figure 3: Further experiments to understand the optimality of concentrated payload. Five different embed-ding algorithms, vertically. Left: relative payload versus feature distortion (relative to distortion at 0.1 bpnc)in individual images. Middle: the same, but distortion calculated for normalized whitened features. Right:capacity versus sensitivity to payload, in individual images. Each plot is from 10 000 images taken at randomfrom the entire database of 800 000.

unsurprisingly, not all features contain useful informationabout payload. After whitening we should talk of “com-ponents” rather than “features”, since each component is alinear combination of features. For components with muchinformation, for example those with the largest eigenvaluesfrom PCT, the addition of payload causes apparently-linearmovement in one direction or the other. But particularly forcomponents with small eigenvalues from PCT, the additionof payload simply causes noise, with the value moving in noparticular direction. Now consider that

‖vp − v0‖ =√

(c1p − c10)2 + · · ·+ (cmp − cm

0)2 (2)

where cip is the i-th component of features from an image

with payload p. If some of these differences cip − ci0 consist

only of noise, (2) boils down to√

cp2 + d, where c and dhave positive expectation. The explains the shapes seen inFigure 3. Essentially it stems from the inclusion of somenoisy features/components which are not informative aboutsteganography. However, it seems impossible for a blind de-tector to avoid this, since it cannot remove non-informativefeatures without some stego information.

Although the nonlinear phenomenon explains why the maxstrategies are more secure than linear or even, we also wantto answer question (2), above. We can explain this quitesimply, by looking at the rate at which ‖vp −v0‖ grows withp. In the third column of Figure 3, we plotted the capac-ity of each image against ‖vp − v0‖/p. The latter quantitywe call sensitivity. Clearly, images with larger capacity haveless sensitivity (the statistical significance of the relationshipis extremely strong for each embedding algorithm). Thusthere is an additional advantage, as well as concentration inas few covers as possible, of the max-greedy strategy becauseit picks the least sensitive covers4.

Finally, we turn to question (3): can the detector be im-proved in light of our new understanding? The authors thinknot: there is no solution to restoring a linear payload rela-tionship by avoiding whitening/normalization, because thisruins the scales of the features and makes the LOF an incor-rect measure of outliers. Although it makes the performanceof the different embedding strategies more equal, it does soby lowering detectability of all of them (charts not includedfor reasons of space).

6. CONCLUSIONSWe have tested the LOF-based anomaly detector from [15]

more widely, to show that it works against different embed-ding algorithms and strategies, that it is quite sensitive topayloads of around 0.1bpnc or lower, and hence real-worldsteganographic algorithms (which excludes nsF5) are ratherinsecure. The consistent lesson is that a greedy embeddingstrategy, which concentrates payloads in as few covers oflargest possible capacity, is able to exploit a property of thedetector. The property is due to a nonlinear relationship be-tween (unavoidably normalized) features and payload size,which is an important insight for both embedders and de-tectors.

We should verify that this is not merely a result for thePF-274 feature set we have used in this work and it would4It would not be in keeping with the philosophy of this paperto allow an embedder directly to measure sensitivity of eachimage in their embedding strategy, since it would requirethem to know the feature set used by their opponent.

be helpful to develop a hypothesis test for the phenomenon.This is a matter for future work.

We have also exposed a gap in the research on steganal-ysis features, to understand their geometric structure andthe effects of steganographic payload beyond merely the ex-istence of quantitative estimators. Further work is neededto determine whether near-linearity of stego distortion canbe increased and exploited.

We must stress that much of the quasi-analysis of sec-tion 5 arises because of the use of linear MMD in the LOFdistances. However, the only examination of an alterna-tive (Gaussian kernel, [14]) showed weaker detection perfor-mance. In theory, a nonlinear detector can have power pro-portional to the square of the payload size, at least locally(essentially the reason for the square root law of stegano-graphic capacity), so further researching into alternativeMMD kernels is important. Another line of research wouldbe to replace LOF with a method more robust to noisy fea-tures.

Finally, we might conjecture that all blind steganalysiswill have a similar weakness: features with pure noise cannotbe removed without knowledge of some stego data, leavinga distortion which must always be nonlinear. It would beinteresting to formalize this and perhaps prove the generaloptimality of greedy embedding. From the detector’s side,we suggest that blind detection could be augmented by non-blind feature selection, in which the feature set is tuned inthe direction of known stego algorithms.

7. ACKNOWLEDGMENTSThe work on this paper was supported by European Of-

fice of Aerospace Research and Development under the re-search grant number FA8655-11-3035. The U.S. Govern-ment is authorized to reproduce and distribute reprints forGovernmental purposes notwithstanding any copyright no-tation there on. The views and conclusions contained hereinare those of the authors and should not be interpreted as nec-essarily representing the official policies, either expressed orimplied, of EOARD or the U.S. Government.

The work of T. Pevny was also supported by the GrantAgency of Czech Republic under the project P103/12/P514.

8. REFERENCES

[1] M. M. Breunig, H.-P. Kriegel, R. T. Ng, andJ. Sander. Lof: Identifying density-based localoutliers. In Proc. 2000 ACM SIGMOD InternationalConference on Management of Data, SIGMOD, pages93–104. ACM, 2000.

[2] T. Filler and J. Fridrich. Gibbs construction insteganography. IEEE Transactions on InformationForensics and Security, 5(4):705–720, 2010.

[3] T. Filler and J. Fridrich. Minimizing additivedistortion functions with non-binary embeddingoperation in steganography. In Information Forensicsand Security (WIFS), 2010 IEEE InternationalWorkshop on, pages 1–6, 2010.

[4] T. Filler and J. Fridrich. Design of adaptivesteganographic schemes for digital images. In N. D.Memon, J. Dittmann, A. M. Alattar, and E. J. DelpIII, editors, Media Watermarking, Security, andForensics XIV, volume 7880 of Proc. SPIE, page78800F. SPIE, 2011.

[5] T. Filler, J. Judas, and J. Fridrich. Minimizingadditive distortion in steganography usingsyndrome-trellis codes. IEEE Transactions onInformation Forensics and Security, 6(3):920–935,2011.

[6] J. Fridrich, M. Goljan, and D. Soukal. Wet papercodes with improved embedding efficiency. IEEETransactions on Information Forensics and Security,1(1):102–110, 2006.

[7] J. Fridrich, T. Pevny, and J. Kodovsky. Statisticallyundetectable JPEG steganography: Dead endschallenges, and opportunities. In Proc. 9th ACMWorkshop on Multimedia and Security, MM&Sec,pages 3–14. ACM, 2007.

[8] A. Gretton, K. M. Borgwardt, M. J. Rasch,B. Scholkopf, and A. J. Smola. A kernel method forthe two-sample problem. pages 513–520, 2007.

[9] S. Hetzl. Implementation of the Steghide algorithmver. 0.5.1 (released October 2003).http://steghide.sourceforge.net/, last accessedApril 2012.

[10] S. Hetzl and P. Mutzel. A graph-theoretic approach tosteganography. In Proc. 9th International Conferenceon Communications and Multimedia Security, CMS,pages 119–128. Springer, 2005.

[11] A. D. Ker. Batch steganography and pooledsteganalysis. In J. Camenisch, C. Collberg,N. Johnson, and P. Sallee, editors, Proc. 8thInformation Hiding Workshop, volume 4437 of LNCS,pages 265–281. Springer, 2006.

[12] A. D. Ker. Batch steganography and the thresholdgame. In E. Delp III and P. Wong, editors, Security,Steganography, and Watermarking of MultimediaContents IX, volume 6505 of Proc. SPIE, pages0401–0413. SPIE, 2007.

[13] A. D. Ker. Perturbation hiding and the batchsteganography problem. In K. Solanki, K. Sullivan,and U. Madhow, editors, Proc. 10th InformationHiding Workshop, volume 5284 of LNCS, pages 45–59.Springer, 2008.

[14] A. D. Ker and T. Pevny. A new paradigm forsteganalysis via clustering. In N. Memon, J. Dittmann,A. Alattar, and E. Delp III, editors, MediaWatermarking, Security, and Forensics XIII, volume7880 of Proc. SPIE, pages 0U01–0U13. SPIE, 2011.

[15] A. D. Ker and T. Pevny. Identifying a steganographerin realistic and heterogeneous data sets. In N. Memon,A. Alattar, and E. Delp III, editors, MediaWatermarking, Security, and Forensics XIV, volume8303 of Proc. SPIE, pages 0N01–0N13. SPIE, 2012.

[16] J. Kodovsky. simulator of the nsF5 algorithm withwet paper codes (released 2008). http://dde.binghamton.edu/download/nsf5simulator/,last accessed April 2012.

[17] A. Latham. Implementation of the JPHide andJPSeek algorithms ver 0.3 (released August 1999).http://linux01.gwdg.de/~alatham/stego.html, lastaccessed April 2012.

[18] T. Pevny, T. Filler, and P. Bas. Usinghigh-dimensional image models to perform highlyundetectable steganography. In P. W. L. Fong,R. Bohme, and R. Safavi-Naini, editors, Proc. 12th

Information Hiding Workshop, volume 6387 of LNCS,pages 161–177. Springer, 2010.

[19] T. Pevny and J. Fridrich. Merging Markov and DCTfeatures for multi-class JPEG steganalysis. In E. J.Delp III and P. W. Wong, editors, MediaWatermarking, Security, and Forensics IX, volume6505, pages 03–14. SPIE, 2007.

[20] T. Pevny, J. Fridrich, and A. D. Ker. From blind toquantitative steganalysis. Information Forensics andSecurity, IEEE Transactions on, 7(2):445–454, 2012.

[21] N. Provos. Defending against statistical steganalysis.In Proc. 10th Conference on USENIX SecuritySymposium - Volume 10, SSYM, pages 323–335.USENIX Association, 2001.

[22] N. Provos. Implementation of the OutGuess algorithmver. 2.0 (released October 2001).http://www.outguess.org/, last accessed April 2012.

[23] D. Upham. Implementation of the JStegsteganographic algorithm.http://zooid.org/~paul/crypto/jsteg/, lastaccessed on April 2012.

[24] A. Westfeld. F5-a steganographic algorithm. InI. Moskowitz, editor, Proc. 4th Information HidingWorkshop, volume 2137 of LNCS, pages 289–302.Springer, 2001.

[25] A. Westfeld. Implementation of the F5 steganographicalgorithm (released May 2011).http://code.google.com/p/f5-steganography/, lastaccessed April 2012.

APPENDIX

A. LOCAL OUTLIER FACTORSuppose that we are given a set P of points, with a metric

d : P × P → R and an integer parameter 1 < k < |P |.For this exposition we assume no exact duplicates in P or

exactly tied distances between members of P , which simpli-fies the description. For full details, see the original publi-cation [1].

The reachability distance of point p from q, rk(p, q),is the greater of d(p, q) and d(q, q′), where q′ is q’s k-nearestneighbour. Compared to the metric d, the reachability dis-tance reduces statistical fluctuations for close objects, withsmoothing controlled by the parameter k.

Fix a point p, and write Pk for the k-nearest neighbour-hood of p. The local reachability density of p is definedas an inverse of the average reachability distance of point pfrom all points q ∈ Pk,

lrdk(p) = k

(

∑

q∈Pk

rk(p, q)

)

−1

,

and the local outlier factor (LOF) of p is

lofk(p) =1

k

∑

q∈Pk

lrdk(q)

lrdk(p).

Thus lofk(p) captures the degree to which p is further fromits k-nearest neighbours than they are from theirs. Definingit as a relative number makes the method adaptive in thesense that (a) it does not depend on absolute values of dis-tances d(p, q) and (b) outliers can be detected in dense aswell as in sparse regions of P .

Batch Steganography in the Real Worldagents.fel.cvut.cz/stegodata/pdfs/Pev12-ACM.pdf · Batch Steganography, Anomaly Detection, Steganalysis, Em-bedding Strategies 1. INTRODUCTION

Documents