Modern Trends in Steganography and Steganalysis Jessica Fridrich State University of New York Department of Electrical and Computer Engineering
Modern Trends in Steganography and Steganalysis
Jessica Fridrich
State University of New York
Department of Electrical and Computer Engineering
Fundamental questions
• How to hide information securely? • What does it mean securely? • How much can I embed undetectably? • How should I embed? • What are the most relevant open problems? • What are the main achievements?
IEEE publications per year
steganography
steganalysis
journal
conference
Number of software stego applications
Data courtesy of Chet Hosmer, Wetstone Tech
Data hiding applications by media
Number of software applications that can hide data in electronic media as of June 2008 Data courtesy of Neil Johnson
Fundamental questions
• How to hide information securely? • What does it mean securely? • How much can I embed undetectably? • How should I embed? • What are the most relevant open problems? • What are the main achievements?
Answers to these questions depend on whom you ask!
Everything depends on the source
Artificial sources (iid sequences, Markov chains) • Steganographic capacity known (positive rate) • Capacity-approaching embedding using QIM • Optimal detectors are LRTs
Empirical sources (digital media) • Model unknown • Capacity unknown • Best embedding algorithms unknown • Optimal detectors unknown
abstraction
reality
Steganographic channel
Cover source random variable x on X Stego source random variable y on X Message source random variable on M Key source random variable on K Channel error-free/noisy (passive/active Warden)
Warden x px
y py
Measure of security DKL(px py) Kullback Leibler divergence (Cachin 1998) Perfect security DKL(px py) 0 -security DKL(px py)
• KL divergence is error exponent for optimal Neyman Pearson detector: PD(PFA) exp( nDKL), n is no. of pixels • Other Ali-Silvey distances could be used Note that x can be - pixels (DCT coefficients)
- groups of pixels - features = low-dimensional image representations - entire images
Steganographic security
• Given full knowledge of the cover source px , perfectly secure stegosystems exist (Anderson & Petitcolas, 1997)
• Can be realized using the principle of cover synthesis by utilizing a perfect compressor of the cover source
• Sender communicates on average H(x) per object sent, the entropy of the cover source
Message source
Encrypted message are random bits m
Source decompressor D y = D(m)
Stego object y is generated (no cover x on sender’s input)
Perfect security
• First practical realization for JPEG images, Model-based steganography (Sallee, IWDW 2003)
Is the maximum rate over all perfectly-secure stegosystems
• Measured in bits per cover element (rate)
• Property of the cover+key+message sources and the channel, not of a specific embedding scheme!
• It is positive – the size of secure payload increases linearly with the cover size.
Steganographic capacity
The most general result (Moulin & Wang, 2004, 2008) for active warden and distortion-limited embedder: Distortion per cover element: Em,k,x[d(x,y)] Ds Channel distortion (y y’): Ex[d(y,y’)] Dw
Steganographic constraint: px py Sender and Warden know the channel px, pm, pk, py y’ Capacity C(Ds,Dw) can be derived when the size of covers, n . Explicit formula available for some simple cover sources: Bernoulli sequence with d(.,.) = Hamming distance (Moulin & Wang, 2004)
Gaussian source with d(.,.) = square distance (Gonzales & Comesaña, 2007)
Steganographic capacity
For empirical sources, px is unknown. Alice cannot preserve it and Warden cannot build LRT If Alice preserves a simplified model, Warden will work outside of the model and detect embedding History teaches us that no matter how hard Alice tries, the Warden will be smarter and will find representation of the covers where detection will be possible (DKL > 0) Alice knows that her scheme is not secure. How big a payload (secure payload) can Alice send with a fixed* risk? *Fixed risk = fixed KL-divergence.
Capacity for empirical sources?
Square root law of imperfect steganography (Ker 2007 for iid sources, Filler 2009, Markov sources)
• Secure payload property of the embedding scheme! This is why we do not use the term “capacity”
• Secure payload grows only with the square root of the cover size
• Model mismatch does not decrease the communication rate but makes it drop to zero! (already suspected by Anderson & Petitcolas 1997) This is fundamentally very different from other situations in communication
• Experimentally verified on images for a wide spectrum of steganographic methods
The square root law of secure payload
1. Cover source is a first-order Markov chain with transition probability matrix A (aij), Pr(xn j | xn+1 i)
- simplest tractable model that captures dependencies among pixels
2. Embedding is a probabilistic mapping applied to each pixel independently Pr(y k | x l) bkl
- captures the essence of most practical stegosystems
3. Steganographic system is not perfect (DKL 0) - we already know that perfectly secure stegosystems have positive rate
(secure payload increases linearly), this result holds only for artificial covers
The square root law assumptions
j k l i
j’ k’ l’ i’
aij ajk akl bii’ bjj’ bkk’ bll’ embedding
stego
cover
HMC
MC
As n ,
Conservative payload: M/ n 0 Stegosystem is asymptotically perfectly secure (DKL 0)
Unsafe payload: M/ n Arbitrarily accurate detector can be built
Borderline payload: M/ n const. Stegosystem is –secure (DKL ) n cover size (e.g., number of pixels) M payload size (message length in bits)
The square root law theorem
n number of cover elements px cover distribution py stego distribution with fraction of β of changed pixels I(0) Fisher information w.r.t. β at β = 0 I(0) = Ex[( log py(β)/ β)2]
DKL(px py) ½ nβ2 I(0) + O(β3) const. Fixed risk nβ2 const. payload nβ n
Where does the SRL come from?
Experimental verification
Se
cu
re p
aylo
ad
M (
bit
s)
• A ten-times bigger image can hold only three-times bigger payload at the same level of statistical detectability - Alice and Bob, be conservative with your payload!
• It is easier to detect the same relative payload in larger images! - careful when comparing steganalysis on different databases - need to fix the payload/ n – the root rate
• Use root rate for comparing/benchmarking/ designing stegosystems
What does this mean for practitioners?
Options available:
• Force embedding to resist a given attack - necessary but not sufficient, usually rather weak
• Adopt model (or its sampled version) and preserve it - unless model sufficiently complex usually easy to detect
• Mimic natural processing / noise - very hard to do right (Franz, 2008)
• Minimize embedding distortion - the best steganographic algorithms for digital images
known today work precisely in this manner - no need to estimate high-dimensional distributions - can work with very high-dimensional models
- well developed, entire framework is in place
So, how shall we embed in practice?
Historical development d(.,.) = number of embedding changes
- Matrix Embedding (Crandall 1998, Westfeld 2001, …) d(.,.) = L1 or L2 distortion
- MMEx by Kim & Duric & Richards 2006, Sachnev & Kim 2009, Perturbed quantization 2004, …)
general d(.,.) - syndrome-trellis codes (Filler 2009–11, Pevný HUGO 2010)
Strategy a) fix the embedding operation b) define the embedding distortion function d(x,y) c) embed by minimizing d(x,y) for a given payload and cover x d) search for d(.,.) that minimizes detectability in practice
Steganography minimizing distortion
• During embedding, cover x is slightly modified to stego object y that carries the desired payload
• For each x, the sender specifies a set of plausible stego images y Y
• LSB embedding: y {x, LSBflip(x)} • k embedding: y {x, x 1,x 1,…,x k, x k} • When x is not to be modified, y {x}
• Distortion measure d(x,y) is an arbitrary bounded function of x and y
• Embedding algorithm selects each y Y with prob. (y)
Expected payload: H( )
Expected distortion: E [d] y (y) d(x,y)
Formalization
Payload-limited sender
Minimize expected distortion E [d] y (y)d(x,y)
subject to payload constraint H( ) M
Distortion-limited sender
Maximize expected payload H( )
subject to distortion bound E [d] y (y)d(x,y)
The optimization is over all distributions over Y
It is and Y that encapsulate the entire embedding algorithm.
They also depend on cover x !
Searching for the best
The optimal distribution (algorithm) for both senders is the Gibbs distribution
(y) 1/Z( ) exp( d(x,y))
> 0 is a scalar parameter determined from the appropriate constraint (the “inverse temperature”)
Z( ) y exp( d(x,y)) is the normalization factor (partition function)
We can import many useful algorithms from statistical physics into steganography
Gibbs distribution
• By sampling from (y), one can simulate the impact of theoretically optimal embedding!
• Gibbs sampler, Markov chain Monte Carlo methods • Detectability can be tested before implementing the scheme
• Near-optimal practical embedding schemes can be constructed using syndrome-trellis codes
• when d(.,.) is a sum of locally-supported potential functions
• includes the most common case when d is additive (more on next slide)
• We can compute the rate-distortion bound relationship between expected distortion E[d] and payload H( )
• Bound tells us how close to optimum a given implementation is • Thermodynamic integration + Gibbs sampler
Separation principle
d(x,y) k k(x,yk)
• k(.,.) arbitrary bounded functions • Embedding changes do not interact • Sampling is easy as each pixel can be changed independently of
other pixels (no need to use MCMC samplers) • Embedding with minimal d is source coding with fidelity constraint
(Shannon 1954) • Near-optimal quantizers exist based on Viterbi algorithm:
Syndrome–trellis codes (Filler 2009)
Examples k(x,yk) 1 iff xk yk, d is number of embedding changes
k(x,yk) {1, } wet paper codes k(x,yk) k for xk yk, changes have different weights
yk M , |M| m, m-ary embedding
Additive distortion
• Represent image using feature vector f Rk • f is an image model • f can be high-dimensional, e.g., 106 or higher • Define distortion
d(x,y) f(x) – f(y)
• Minimizing d(.,.) model preservation
Embedding while preserving model
HUGO (Highly Undetectable steGO) by Pevný, Filler, Bas (IH 2010, Calgary) - uses model of dimensionality 107 - steganalyzers of 2010 unable to detect payloads 0.4 bpp
• Message hidden while minimizing an additive distortion function designed to correlate with statistical detectability
• Incorporates syndrome-trellis codes for near-optimal content-adaptive embedding
actual changes cover image
HUGO
• Cast as supervised pattern recognition problem (Avcibas, Memon 2001, Farid 2002)
• Represent images using features (dim. reduction) - features sensitive to embedding but not to content - often computed from the noise component - sampled joint or conditional distribution (Shi & Chen & Chen
2006–2008) - designed to capture dependencies among neighboring pixels
• Train a classifier on examples of cover/stego images - support vector machines, neural networks, ensemble classifiers - binary, one-class, or multi-class - regressors can estimate change rate
Modern steganalysis
• Rich cover model - union of many smaller diverse submodels - all dimensions well populated - diversity more important than dimensionality
• Scalable machine learning - ensemble classifier - fast w.r.t. dimensionality as well as training set size - allows using rich cover models
Steganalysis using rich models
• Define pixel predictors Pred(k)(x), k = 1, …, p, Pred(xij) is the value of xij predicted from local neighborhood not containing xij
• Compute: kth noise residual R(k) = Pred(k)(x) – x • Quantize: R(k) = round(R(k) / q), q = quantization step • Truncate: if R(k) > T then R(k) = T • Form co-occurrence matrices of neighboring residual samples
horizontally constant model horizontally linear model horizontally quadratic model … Other options include vertical models, predictors utilizing local 3 3 or 5 5 neighborhoods or their parts, non-linear operations min and max, lookup tables of conditional probabilities learned from cover source images, etc.
ij i, j+1 i,j-1Pred(x ) (x x )/2
ij i, j-1 i, j+1 i,j+2Pred(x ) (x 3x x )/3
ij i, j+1Pred(x ) x
Assembling the rich model (spatial domain)
Image with many edges
Edge close up
Adaptive schemes are likely to embed here even though the content is modelable in the vertical direction However, pixel differences will mostly be in the marginal. Linear or quadratic models bring the residual back inside the co-occurrence matrix
Complex models better adapt to content
Feature space
dim = N
Random subspace dim = k << N
Random subspace
Random subspace
… …
Base learner B1
Base learner B2
Base learner BL
Classifier fusion
Ensemble classifiers
• Ensemble classifier built as random forest • Decision obtained by fusing decisions of L simple, diverse, and
unstable base learners B1, …, BL (majority voting) • Base learners = Fisher Linear Discriminant (analytic formula) • FLDs trained on a bootstrap sample of the training set (bagging)
to increase diversity • Parameters: L (number of base learners) and k can be determined
by search using Out of Bag error estimates (OOBs)
Performance Ensemble outperforms linear SVMs Comparable to Gaussian SVMs
Complexity as a function of the training set size N trn
Algorithm: nsF5, feature set 548-dimensional CC-PEV
Complexity comparison with SVMs
Example I: Detecting HUGO
dim 1234
330 985
3286 12753
× model
CDF 2 rich 6 rich
20 rich 78 rich
Image source: 9074 BOSSbase images grayscale 512 512
number of pixel predictors
Example II: Detecting ±1 embedding
dim 1234
330 985
3286 12753
× model
CDF 2 rich 6 rich
20 rich 78 rich
Image source: 9074 BOSSbase images grayscale 512 512
number of pixel predictors
Steganography in empirical covers • Reduced to finding a good distortion function • For example, specify the cost of changes • Available to you: Near-optimal codes, rate–distortion
bound, simulators of optimal embedding
Steganalysis design • Reduced to finding a sufficiently rich model • Should be diverse • High-dimensional = no longer a problem (ensemble classifiers)
Closing remarks
http://dde.binghamton.edu/download/
It is all about the model …
• Can we build a “complete” model of digital images? - is it possible at all? (empirical sources are incognizable) - if not, how far are we?
• Who is winning: Alice&Bob or Warden? - can we establish fundamental bounds using game theory? - robust classification is needed for practice
• Applications beyond steganography - forensics
- source coding
What is next?
Do not skate where the puck is. Skate where the puck is gonna be. Wayne Gretzky