Top Banner
Signal Processing: Image Communication 67 (2018) 132–139 Contents lists available at ScienceDirect Signal Processing: Image Communication journal homepage: www.elsevier.com/locate/image Paired mini-batch training: A new deep network training for image forensics and steganalysis Jin-Seok Park, Hyeon-Gi Kim, Do-Guk Kim, In-Jae Yu, Heung-Kyu Lee * School of computing, Korea Advanced Institute of Science and Technology (KAIST), Daehak-ro, Yuseong-gu, Daejeon, 34141, Republic of Korea ARTICLE INFO Keywords: Deep learning Deep convolutional neural networks Mini-batch Image forensics Steganalysis ABSTRACT Deep convolutional neural networks (convnets) have recently become popular in many research areas because convnets can extract features automatically and classify them with high accuracy. Researchers in the image forensics and steganalysis field have proposed methods using convnets to develop technologies that work in practical environments. However, they found that the convnets used for computer vision were not suitable for image forensics and steganalysis because these convnets tend to learn features that represent the contents of images rather than forensic or steganalysis features. To overcome this limitation, researchers have proposed various structures, but there are no studies that take into account other factors related to training neural networks for image forensics and steganalysis. In this paper, we clearly represent the training process for image forensics and steganalysis using a training equation and explain why training convnets with the standard mini-batch is inefficient for image forensics and steganalysis. We then propose a new mini-batch, called the paired mini-batch, which is better suited for image forensics and steganalysis. 1. Introduction Deep convolutional neural networks (convnets) have demonstrated excellent performance in applications for various types of computer vision. Since Krizhevsky et al. reported that convnet could be clas- sified in various categories with high accuracy [1], many computer vision methods using convnets have been proposed: for example, face detection [2], pedestrian detection [3], saliency detection [4], super- resolution of images [5], video classification [6], etc. The methods proposed using convnets are simpler and perform better than existing hand-craft-based methods. Recently, convnets have been employed for computer vision as well as in many research areas because they can generate features auto- matically and distinguish different categories with high accuracy even in complex environments. In speech recognition [7], natural language processing [8,9], and other research fields [10,11], several technologies have already demonstrated good performance using convnets, and researchers in the image forensics and steganalysis fields have also developed technologies using convnets. Both image forensics and steganalysis involve the classification of normal and manipulated images. Image forensics [12,13] is aimed at de- termining whether an image is genuine or forged, which means checking whether the image was directly captured with a camera or manipulated * Corresponding author. E-mail address: [email protected] (H.-K. Lee). using image editing programs such as Photoshop. Steganalysis [14] is aimed at determining whether an image contains secret messages. The results of steganalysis can be classified into two types: normal or stego, which indicates the presence of secret messages. Fig. 1 shows examples of normal and manipulated images generated through image manipulation and steganography, as well as the differences between the two types of images. These image types contain slight differences, which should be classified through image forensics and steganalysis. Previous image forensics and steganalysis methods relied on hand- crafted features resulting from the process of image acquisition with a camera or from the image editing process. These features include photo response non uniformity [15], color filter array patterns [16], meaningful noise [17], discrete cosine transform coefficients [18], etc. Existing methods show good performance in specific environments but the results are poor in practical situations because the real world contains many types of forged images (or steganography), created with different image compression method, each with different properties. To overcome this limitation, researchers have proposed methods using convnets. However, image forensics and steganalysis did not perform well with the same convnets used in computer vision [19,20] because both fields have different requirements compared to computer vision. Consequently, researchers have proposed new convnets structures for https://doi.org/10.1016/j.image.2018.04.015 Received 30 November 2017; Received in revised form 2 March 2018; Accepted 29 April 2018 Available online 15 June 2018 0923-5965/© 2018 Elsevier B.V. All rights reserved.
8

Signal Processing: Image Communication - KAISThklee.kaist.ac.kr/publications/Signal Processing Image... · 2018-07-09 · manipulated images. ∙ Differences between normal and manipulated

Jun 20, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Signal Processing: Image Communication - KAISThklee.kaist.ac.kr/publications/Signal Processing Image... · 2018-07-09 · manipulated images. ∙ Differences between normal and manipulated

Signal Processing: Image Communication 67 (2018) 132–139

Contents lists available at ScienceDirect

Signal Processing: Image Communication

journal homepage: www.elsevier.com/locate/image

Paired mini-batch training: A new deep network training for imageforensics and steganalysisJin-Seok Park, Hyeon-Gi Kim, Do-Guk Kim, In-Jae Yu, Heung-Kyu Lee *School of computing, Korea Advanced Institute of Science and Technology (KAIST), Daehak-ro, Yuseong-gu, Daejeon, 34141, Republic of Korea

A R T I C L E I N F O

Keywords:Deep learningDeep convolutional neural networksMini-batchImage forensicsSteganalysis

A B S T R A C T

Deep convolutional neural networks (convnets) have recently become popular in many research areas becauseconvnets can extract features automatically and classify them with high accuracy. Researchers in the imageforensics and steganalysis field have proposed methods using convnets to develop technologies that work inpractical environments. However, they found that the convnets used for computer vision were not suitable forimage forensics and steganalysis because these convnets tend to learn features that represent the contents ofimages rather than forensic or steganalysis features. To overcome this limitation, researchers have proposedvarious structures, but there are no studies that take into account other factors related to training neural networksfor image forensics and steganalysis. In this paper, we clearly represent the training process for image forensicsand steganalysis using a training equation and explain why training convnets with the standard mini-batch isinefficient for image forensics and steganalysis. We then propose a new mini-batch, called the paired mini-batch,which is better suited for image forensics and steganalysis.

1. Introduction

Deep convolutional neural networks (convnets) have demonstratedexcellent performance in applications for various types of computervision. Since Krizhevsky et al. reported that convnet could be clas-sified in various categories with high accuracy [1], many computervision methods using convnets have been proposed: for example, facedetection [2], pedestrian detection [3], saliency detection [4], super-resolution of images [5], video classification [6], etc. The methodsproposed using convnets are simpler and perform better than existinghand-craft-based methods.

Recently, convnets have been employed for computer vision as wellas in many research areas because they can generate features auto-matically and distinguish different categories with high accuracy evenin complex environments. In speech recognition [7], natural languageprocessing [8,9], and other research fields [10,11], several technologieshave already demonstrated good performance using convnets, andresearchers in the image forensics and steganalysis fields have alsodeveloped technologies using convnets.

Both image forensics and steganalysis involve the classification ofnormal and manipulated images. Image forensics [12,13] is aimed at de-termining whether an image is genuine or forged, which means checkingwhether the image was directly captured with a camera or manipulated

* Corresponding author.E-mail address: [email protected] (H.-K. Lee).

using image editing programs such as Photoshop. Steganalysis [14]is aimed at determining whether an image contains secret messages.The results of steganalysis can be classified into two types: normal orstego, which indicates the presence of secret messages. Fig. 1 showsexamples of normal and manipulated images generated through imagemanipulation and steganography, as well as the differences between thetwo types of images. These image types contain slight differences, whichshould be classified through image forensics and steganalysis.

Previous image forensics and steganalysis methods relied on hand-crafted features resulting from the process of image acquisition witha camera or from the image editing process. These features includephoto response non uniformity [15], color filter array patterns [16],meaningful noise [17], discrete cosine transform coefficients [18], etc.Existing methods show good performance in specific environmentsbut the results are poor in practical situations because the real worldcontains many types of forged images (or steganography), created withdifferent image compression method, each with different properties.To overcome this limitation, researchers have proposed methods usingconvnets.

However, image forensics and steganalysis did not perform wellwith the same convnets used in computer vision [19,20] becauseboth fields have different requirements compared to computer vision.Consequently, researchers have proposed new convnets structures for

https://doi.org/10.1016/j.image.2018.04.015Received 30 November 2017; Received in revised form 2 March 2018; Accepted 29 April 2018Available online 15 June 20180923-5965/© 2018 Elsevier B.V. All rights reserved.

Page 2: Signal Processing: Image Communication - KAISThklee.kaist.ac.kr/publications/Signal Processing Image... · 2018-07-09 · manipulated images. ∙ Differences between normal and manipulated

J.-S. Park et al. Signal Processing: Image Communication 67 (2018) 132–139

(a) Normal. (b) Gaussian noise. (c) Difference.

(d) Normal. (e) Median filtering. (f) Difference.

(g) Normal. (h) Stego. (i) Difference.

Fig. 1. The left three images are normal images, and the middle three imagesare manipulated images with additive white Gaussian noise (𝜎 = 1), medianfiltering, and S-UNIWARD, respectively. The right three images represent thedifference between the normal image and manipulated image, with the whitepixels represent +5 and the black pixels represent −5.

image forensics and steganalysis. Researchers in steganalysis found thatthe average pooling layer is better than the max pooling layer [21] forsteganalysis. In addition, in both research fields, inserting the filteredresidual image into convnets yields better results than inserting the fullimage block [19–22].

Although there are studies that propose convnets structures forimage forensics and steganalysis, none consider other aspects of trainingneural networks for image forensics and steganalysis. In this paper,we clarify how image forensics and steganalysis differ from computervision when using convnets. In consideration of the different features,we propose a new mini-batch configuration method named the pairedmini-batch for image forensics and steganalysis.

The rest of this paper is organized as follows. Section 2 introducesrelated works on image forensics and steganalysis that proposed con-vnets structures. Section 3 explains the training process for convnetsusing paired mini-batches and theoretically demonstrates how thetraining efficiency varies depending on the mini-batch used. Section 4describes the efficiency of the proposed paired mini-batch trainingmethod through various experiments. The conclusion is presented inSection 5.

2. Related works

In this section, we discuss existing works related to image forensicsand steganalysis using convnets. All prior work focused on the structureof convnets but did not consider other training properties.

2.1. Image forensics using convnets

The detection of median filtering is important in image forensicsbecause a median filter inconspicuously removes the traces of manip-ulation. In [19], Chen et al. first applied convnets to image forensicsand subsequently developed a method to determine which images weremedian filtered. Remarkably, they found that median filtering detectiondid not perform well with the conventional convnets used in computervision. They concluded that the fingerprint left by median filtering isheavily affected by image edges and textures, and proposed the insertion

of the median filtering residual (MFR) noise rather than an image intoconvnets.

Bayar and Stamm proposed a new convolutional layer for imageforensics [22]. They posited that the standard convnets used in computervision were specialized for object recognition rather than image foren-sics because the convnets tended to learn features that represent thecontents of images rather than manipulation features. For this reason,they designed a new convolutional layer similar to a high pass filter tofocus on forensic features.

2.2. Steganalysis using convnets

Qian et al. first demonstrated that convnets could be used foruniversal steganalysis [20]. They attached a high-pass filter, the KVfilter, in front of the convnets to make the stego signal stronger and theimage content signal weaker. They also insisted that a Gaussian activefunction is better than rectified linear unit (ReLU) activation and thatthe average pooling layer is suitable for steganalysis.

In [21], Pibre et al. performed various experiments to identify theconvnets structure most appropriate for steganalysis. They designeda total of 40 networks and performed comparative experiments. Theresults showed that the average pooling layer is better than the maxpooling layer and that the ReLU activation function performs better thanthe Gaussian active function, which contradicts the claims of [20].

3. Paired mini-batch training

All related studies determined that the convnets used for computervision are not suitable for image forensics and steganalysis because theseconvnets tend to learn features that represent the contents of imagesrather than forensic (or steganalysis) features. In Section 3, we describethis phenomenon with an equation and propose a new mini-batch, calledthe paired mini-batch, to overcome the limitations of prior trainingmethods.

3.1. Gradient descent optimization

Gradient descent optimization is a representative method for trainingneural networks. In neural network training, the weights of networkschange according to gradient descent optimization in the direction ofdecreasing loss value. To change the weights, the loss value shouldbe calculated first. This is achieved through a feed-forward process.First, the output of the networks is calculated using input data andnetwork weights. Next, the loss value is calculated by comparing theoutput and the label of input data. Depending on how much data isused in one training session, the training method can be divided intothree categories: stochastic, mini-batch, and full-batch gradient descentoptimization [23,24].

Stochastic gradient descent optimization is performed by modifyingthe weights of the networks using only one data. If the input is 𝑥,the label is 𝑦, the weights of the networks are 𝛩, and 𝑓 is a functionof calculating the loss 𝐿 after the feed-forward process. The processof training neural networks using stochastic gradient descent can beexpressed as follows:

𝐿 = 𝑓 (𝑥, 𝑦;𝛩) (1)

𝜃𝑗+1 ← 𝜃𝑗 − 𝛼 𝜕𝐿𝜕𝜃𝑗

,∀𝜃 (2)

where 𝛼 is learning rate and 𝑗 is the number of training iterations.Stochastic gradient descent optimization has a disadvantage in that itdoes not reflect the characteristics of the entire data because it is trainedusing only one random data.

Full-batch and mini-batch gradient descent optimization use multi-ple data for one training session. The entire training dataset is used forfull-batch gradient optimization, whereas several training datasets are

133

Page 3: Signal Processing: Image Communication - KAISThklee.kaist.ac.kr/publications/Signal Processing Image... · 2018-07-09 · manipulated images. ∙ Differences between normal and manipulated

J.-S. Park et al. Signal Processing: Image Communication 67 (2018) 132–139

(a) Standard mini-batch. (b) Paired mini-batch.

Fig. 2. The standard mini-batch (left) consists of data extracted randomly from all data, and the proposed paired mini-batch (right) consists of the extraction ofseveral data in pairs from all data.

used for mini-batch gradient optimization. When the number of dataused for learning is 𝑘, the neural networks are trained with the averagederivative of loss calculated from 𝑘 data.

𝜃𝑗+1 ← 𝜃𝑗 − 𝛼 1𝑘

𝑘∑

𝑖=1

𝜕𝐿𝑖𝜕𝜃𝑗

,∀𝜃 (3)

If 𝑈 is a function of updating weights of neural networks and 𝐵 is amini-batch, then (3) can be expressed as below:

𝛩𝑗+1 = 𝑈 (𝛩𝑗 , 𝐵) (4)

If the weights of networks are updated by full-batch gradient descentoptimization, It will be a change that reflects the characteristics ofthe entire dataset. However, full-batch gradient descent optimization isnot practical because it requires extensive memory. For these reasons,mini-batch gradient descent optimization is usually used to train neuralnetworks.

Because mini-batches are selected by random sampling from thewhole data, it maintains most of the characteristics of the entire data,however, the training results could be different according how constructthe mini-batches from a whole dataset.

𝑈 (𝑈 (𝛩𝑗 , 𝐵1), 𝐵2) ≠ 𝑈 (𝑈 (𝛩𝑗 , 𝐵3), 𝐵4) (5)

where{

𝐵1, 𝐵2}

and{

𝐵3, 𝐵4}

are two mini-batch groups selected asdifferent combinations from a same entire dataset.

3.2. Limitation of the standard mini-batch

Generally the mini-batch is selected through random sampling; wecall this mini-batch the standard mini-batch in this paper. We foundthat the standard mini-batch is not efficient for training convnets inimage forensics and steganalysis because these fields have differentrequirements compared to computer vision:

∙ Image forensics and steganalysis distinguish between normal andmanipulated images.

∙ Differences between normal and manipulated images are notvisually detectable but are hidden within the images.

∙ Training image data for the two types of images are always inpairs.

Fig. 2(a) shows an example of selecting a standard mini-batch usingrandom sampling. When using the standard mini-batch in Fig. 2(a) thelosses and changes in weights are represented as follows:

𝐿1 ={

𝑓 (𝐴,𝑁 ;𝛩), 𝑓 (𝐶,𝑁 ;𝛩), 𝑓 (𝐶∗,𝑀 ;𝛩), 𝑓 (𝐷∗,𝑀 ;𝛩)}

(6)

𝜃𝑗+1 ← 𝜃𝑗 − 𝛼 14∑

𝐿1

𝜕𝐿𝜕𝜃𝑗

,∀𝜃 (7)

where the label for normal images is 𝑁 and the label for manipulatedimages is 𝑀 .

To analyze the change in weights, (7) can be rewritten as below:

𝛥𝜃𝑗 = −𝛼4(𝜕𝐿𝐴𝜕𝜃𝑗

+𝜕𝐿𝐶𝜕𝜃𝑗

+𝜕𝐿𝐶 ∗𝜕𝜃𝑗

+𝜕𝐿𝐷 ∗𝜕𝜃𝑗

) (8)

𝛥𝜃𝑗 = −𝛼2( 12(𝜕𝐿𝐶𝜕𝜃𝑗

+𝜕𝐿𝐶 ∗𝜕𝜃𝑗

) + 12(𝜕𝐿𝐴𝜕𝜃𝑗

+𝜕𝐿𝐷 ∗𝜕𝜃𝑗

)) (9)

During one training, the weights of convnets are updated as (9). Thechange in weights can be represented of in two parts: 𝜕𝐿𝐶∕𝜕𝜃𝑗 + 𝜕𝐿𝐶 ∗∕𝜕𝜃𝑗 and 𝜕𝐿𝐴∕𝜕𝜃𝑗 + 𝜕𝐿𝐷 ∗ ∕𝜕𝜃𝑗 . In the former case, the change inweights is calculated by 𝐿𝐶 and 𝐿𝐶 ∗. Two losses were calculated fromdifferent categories: 𝐿𝐶 is calculated with label 𝑁 and 𝐿𝐶 ∗ is calculatedwith label 𝑀 . The two losses were calculated from the same content.This means that the change in weights turns toward in the direction ofclassifying the slight manipulation difference between the normal andmanipulated image.

In the latter case, however, the weights are changed using 𝐿𝐴 and𝐿𝐷 ∗. The two losses were also calculated from different categories,but the two losses were calculated from the two contents. Because thedifference in image content is greater than the difference in imagemanipulation, the change in weights using the two losses turns towardin the direction of classifying different contents. This phenomenonprevents convnets from distinguishing between normal and manipulatedimages in image forensics and steganalysis.

3.3. Proposed paired mini-batch

To overcome the limitation of using the standard mini-batch, wepropose a new mini-batch called the paired mini-batch, which is createdby selecting two categories of data in pairs. Creating a paired mini-batchis simple, but it makes training convnets faster and more accurate inimage forensics and steganalysis.

Fig. 2(b) shows an example of selecting a paired mini-batch and thelosses and changes in weights are represented as follows:

𝐿2 ={

𝑓 (𝐴,𝑁 ;𝛩), 𝑓 (𝐶,𝑁 ;𝛩), 𝑓 (𝐴∗,𝑀 ;𝛩), 𝑓 (𝐶∗,𝑀 ;𝛩)}

(10)

𝜃𝑗+1 ← 𝜃𝑗 − 𝛼 14∑

𝐿2

𝜕𝐿𝜕𝜃𝑗

,∀𝜃 (11)

To analyze the change in weights, (11) can be rewritten as below:

𝛥𝜃𝑗 = −𝛼4(𝜕𝐿𝐴𝜕𝜃𝑗

+𝜕𝐿𝐶𝜕𝜃𝑗

+𝜕𝐿𝐴 ∗𝜕𝜃𝑗

+𝜕𝐿𝐶 ∗𝜕𝜃𝑗

) (12)

𝛥𝜃𝑗 = −𝛼2( 12(𝜕𝐿𝐴𝜕𝜃𝑗

+𝜕𝐿𝐴 ∗𝜕𝜃𝑗

) + 12(𝜕𝐿𝐶𝜕𝜃𝑗

+𝜕𝐿𝐶 ∗𝜕𝜃𝑗

)) (13)

In one training session, the weights of the convnets change as shownin (13), and the change in weights can be considered as two parts:

134

Page 4: Signal Processing: Image Communication - KAISThklee.kaist.ac.kr/publications/Signal Processing Image... · 2018-07-09 · manipulated images. ∙ Differences between normal and manipulated

J.-S. Park et al. Signal Processing: Image Communication 67 (2018) 132–139

𝜕𝐿𝐴∕𝜕𝜃𝑗 + 𝜕𝐿𝐴 ∗ ∕𝜕𝜃𝑗 and 𝜕𝐿𝐶∕𝜕𝜃𝑗 + 𝜕𝐿𝐶 ∗ ∕𝜕𝜃𝑗 . In contrast to (9),the losses in the two parts are generated from the same content. Thismeans that the change in weights turns toward in the direction of theaverage of the two manipulations.

To analyze the difference of using standard and paired mini-batch ingeneral situation, define the 𝛥𝐷 and 𝛥𝑃 as below:

𝛥𝐷 =𝜕𝐿𝑋𝜕𝜃𝑗

+𝜕𝐿𝑌 ∗𝜕𝜃𝑗

(14)

𝛥𝑃 =𝜕𝐿𝑋𝜕𝜃𝑗

+𝜕𝐿𝑋 ∗𝜕𝜃𝑗

(15)

𝛥𝑅 = 𝛥𝐷 𝑜𝑟 𝛥𝑃 (16)

𝛥𝐷 means a change of weights when a manipulated image havedifferent content to the normal image, 𝛥𝑃 means a change of weightswhen a manipulated image have same content to the normal image, and𝛥𝑅 means that the changing weights can be 𝛥𝐷 or 𝛥𝑃 .

As mentioned in (9) and (13), 𝛥𝐷 turns toward in the direction ofclassifying different contents rather than classifying the subtle differencebetween normal and manipulated images and the phenomenon willbecome greater as the learning rate 𝛼 becomes higher. On the otherhand, 𝛥𝑃 turns toward in the direction of classifying only the subtledifference between normal image and manipulated images. For thisreason, the larger the number of 𝛥𝑃 in a mini-batch, the better fortraining convnets to distinguish normal and manipulated images.

If a batch size is 𝑘 = 2𝑚 and the batch has the equal number ofnormal and manipulated images, then the losses are as follows:

𝐿 ={

𝐿1, 𝐿2∗ ,… , 𝐿2𝑚−1, 𝐿2𝑚∗}

(17)

The convnets are trained with (17), and 𝛥𝜃𝑗 is could be expressed as(18)–(19).

𝛥𝜃𝑗 = − 𝛼𝑚( 12(𝜕𝐿1𝜕𝜃𝑗

+𝜕𝐿2∗

𝜕𝜃𝑗) +⋯ + 1

2(𝜕𝐿2𝑚−1𝜕𝜃𝑗

+𝜕𝐿2𝑚∗

𝜕𝜃𝑗)) (18)

𝛥𝜃𝑗 = − 𝛼2𝑚

(𝛥𝑅1 +⋯ + 𝛥𝑅𝑚) (19)

When the mini-batch is selected by random sampling, some normaland manipulated images are in pairs, but most normal and manipulatedimages exist alone, therefore (19) could be rewritten as (20).

𝛥𝜃𝑗 = − 𝛼2𝑚

((𝛥𝑃1 +⋯ + 𝛥𝑃𝑛) + (𝛥𝐷𝑛+1 +⋯ + 𝛥𝐷𝑚)) (20)

The number of 𝛥𝑃 is determined by the following probability:

𝑃𝑟(𝑛 = 𝑖)=𝑚𝐶𝑖𝑁−𝑚𝑃𝑚−𝑖∗𝑚𝑃𝑖

𝑁𝑃𝑚(21)

That is, as the total data 𝑁 is relatively larger than the batch size 𝑚,the probability that the number of 𝑃 is small is increased, which causeslearning inefficiency.

On the other hand, if a mini-batch is constructed using the pairedmini-batch method, the changing weights are as follows (22)–(24) andthe weights of the convnets are updated in the direction of the averageof several subtle manipulations.

𝐿 ={

𝐿1, 𝐿1∗ ,… , 𝐿𝑚, 𝐿𝑚∗}

(22)

𝛥𝜃𝑗 = − 𝛼𝑚( 12(𝜕𝐿1𝜕𝜃𝑗

+𝜕𝐿1∗

𝜕𝜃𝑗) +⋯ + 1

2(𝜕𝐿𝑚𝜕𝜃𝑗

+𝜕𝐿𝑚∗

𝜕𝜃𝑗)) (23)

𝛥𝜃𝑗 = − 𝛼2𝑚

(𝛥𝐷1 +⋯ + 𝛥𝐷𝑚) (24)

This change in weights is irrelevant to the convnet structures and itmeans that the paired mini-batch can be applied to any convnets.

When there are 𝑁 normal images and the corresponding 𝑁 ma-nipulated images in the dataset, the network training for one Epoch isperformed by the below step-by-step process.

Algorithm 1: Training process using the paired mini-batchInput: Network with trainable parameters 𝛩,

Normal training data 𝑋𝑁 ={

𝑥1𝑁 , ..., 𝑥𝑚𝑁}

,Number of Epoch 𝑁𝐸 ,Batch size 𝑘, Learning rate 𝛼

1 for j=1 to m do2 𝑥𝑗𝑀 = Image manipulation(𝑥𝑗𝑁 ) or Steganography(𝑥𝑗𝑁 )3 end4 𝑋𝑀 =

{

𝑥1𝑀 , ..., 𝑥𝑚𝑀}

5 𝑛 = 2𝑚∕𝑘6 for 𝑖 = 1 to 𝑁𝐸 do7 [𝑋𝑁𝑆 , 𝑋𝑀𝑆 ] ← Permute ([𝑋𝑁 , 𝑋𝑀 ])8 for j=0 to n-1 do9 𝑋𝑁𝐵 ← 𝑋𝑁𝑆

[

(𝑗 ∗ 𝑘∕2) + 1 ∶ (𝑗 ∗ 𝑘∕2) + 𝑘∕2]

10 𝑋𝑀𝐵 ← 𝑋𝑀𝑆[

(𝑗 ∗ 𝑘∕2) + 1 ∶ (𝑗 ∗ 𝑘∕2) + 𝑘∕2]

11 𝑋𝐵 ← Merge(𝑋𝑁𝐵 , 𝑋𝑀𝐵)12 𝐿 = 𝑓 (𝑋𝐵 ;𝛩𝑗 )13 𝛩𝑗+1 ← 𝛩𝑗 − 𝛼∇𝐿(𝛩𝑗 )14 end15 end

Step 1: Set the training dataset with 𝑁 normal images and 𝑁 manip-ulated images.

Step 2: Select 𝑚 normal images from the dataset by random sampling.Step 3: Select 𝑚 manipulated images which have same contents to

selected normal images.Step 4: Construct a mini-batch which has 2𝑚 images with 𝑚 selected

normal images and 𝑚 manipulated images.Step 5: Calculate the loss by feed-forward process with constructed

mini-batch and update the weights of the neural networks withcalculated loss.

Step 6: Remove the used normal and manipulated images from thedataset.

Step 7: Back to the Step 2 if the dataset is not empty.

Algorithm 1 describes the full process from creating a paired mini-batch to training the convnets for 𝑁𝐸 Epochs.

4. Experiments and analysis

In this section, we compare the test accuracy according to the num-ber of iterations using the standard mini-batch and the proposed pairedmini-batch. Experiments were conducted on three different convnetstructures. All experiments were conducted using Tensorflow 0.12.0 inUbuntu 16.04, and GeForece GTX 1080 was used to train the convnets.

4.1. Detecting manipulated images using standard convnet

First, we conducted experiments to detect manipulated images usinga standard convnet in image forensics. The standard convnet were thoseused in image recognition, that is, the networks without a high-passfilter. The convnet were designed to be similar to VGG nets [25] andhad eight convolutional layers, four max pooling layers, and three fullyconnected layers. ReLU was used for the activation function, and thelearning rate and momentum were set to 1e-05 and 0.9. The convnetswere trained for 20 Epochs, which lasted approximately 24 h.

Training and test images were accomplished by cutting images ofthe Bossbase dataset [26] that was captured by eight camera models.The images were cropped to 256 × 256 pixels gray images, and a totalof 200,000 normal images were generated. Manipulated images weregenerated using the following three methods and a total of 600,000,three sets of 200,000, manipulated images were generated.

∙ Median filtering with a 3 × 3 kernel

135

Page 5: Signal Processing: Image Communication - KAISThklee.kaist.ac.kr/publications/Signal Processing Image... · 2018-07-09 · manipulated images. ∙ Differences between normal and manipulated

J.-S. Park et al. Signal Processing: Image Communication 67 (2018) 132–139

Fig. 3. The graphs show the change in training loss (left) and test accuracy (right) for each experiment. The two top graphs show the results for distinguishing thenormal and Gaussian noise (𝜎 = 1) images and the two bottom graphs show the results for distinguishing the median filtered images using the standard convnet.

Table 1The maximum accuracy when using the standard mini-batch (STD) and the paired mini-batch (PAIR) to distinguish medianfiltering (Median), Gaussian blur filtering (Blur), and additive white Gaussian noise (AWGN) using the standard convnet.

∙ Additive white Gaussian noise (AWGN) with a standard deviation𝜎 = 1

∙ Gaussian blurring with a 3 × 3 kernel and a standard deviation𝜎 = 0.4

Fig. 3 shows the change in training loss (left) and test accuracy(right) when the convnet were trained to distinguish between the normaland manipulated images. The two top graphs show the results fordistinguishing additive white Gaussian noise images and these appearvery meaningful results. As previous researchers reported, convnetswithout a high-pass filter could not adequately distinguish manipulatedimages when using a standard mini-batch (circle). On the other hand,when using the proposed paired mini-batch (square), the loss was

reduced to about 0.2 and the maximum accuracy recorded was 0.9,despite using the convnet without a high-pass filter.

The results mean that we can train a convnet to distinguish normalimages and manipulated images if we use paired mini-batch, even if itis impossible to train using the standard mini-batch.

The two bottom graphs show the results for median filtering de-tection. The lowest performance among all experiments, reflected bythe difference in the final accuracy, was only 0.0031. Although thedifference was small, the average accuracy using the paired mini-batchwas always better than that using the standard mini-batch. The requirednumber of training iterations to reach an accuracy of 0.9 was 71.83%when using the paired mini-batch compared to the number of iterationsrequired for the standard mini-batch.

136

Page 6: Signal Processing: Image Communication - KAISThklee.kaist.ac.kr/publications/Signal Processing Image... · 2018-07-09 · manipulated images. ∙ Differences between normal and manipulated

J.-S. Park et al. Signal Processing: Image Communication 67 (2018) 132–139

Table 2The maximum accuracy when using the standard mini-batch (STD) and the paired mini-batch (PAIR) to distinguish median filtering (Median), Gaussian blur filtering(Blur), and additive white Gaussian noise (AWGN) of various parameters using the standard convnet.

Table 1 shows the maximum accuracy according to the number oftraining Epochs. The accuracy when using the paired mini-batch wasalways higher than that using the standard mini-batch. The differencein accuracy when using the paired and standard mini-batches withAWGN and blur detection was large, whereas the difference with medianfiltering detection was small. As mentioned in Section 3, when usingthe standard mini-batch, convnets tend to learn features related tothe contents of the images rather than the manipulation features. Thistendency becomes more apparent as the difference between imagecontents decreases. The l2-norm distance for all pixels in the normalimages and the images manipulated using AWGN, median, and blur,was 0.8559, 1.6226, and 0.7210.

4.2. Detecting manipulated images using convnet with a Bayar layer

Second experiments were performed to detect manipulated imagesusing a convnet with a Bayar layer. The structure of the convnet was thesame as those used in the first experiment with the addition of a Bayarlayer. The learning rate was set to 5e-05, and other training parameterswere the same as in the first experiment.

Manipulated images was made by median filtering, Gaussian blur-ring, and additive white Gaussian noise as experiment 1, and variousparameters were used. A total of three sigma parameters were usedfor making Gaussian blur image, and four different compression stateswere used for making AWGN images. A total of 1,600,000, eightsets of 200,000, manipulated images were generated and a total of800,000, four sets of 200,000, normal images were generated accordingcompression parameters.

∙ Median filtering with a 3 × 3 kernel∙ Gaussian blurring with a 3 × 3 kernel and a standard deviation𝜎 = 0.3, 0.35, 0.4

∙ Additive white Gaussian noise (AWGN) with a standard deviation𝜎 = 1 in RAW and JPEG compression with quality Q90, Q93, and,Q96.

Table 2 shows the maximum accuracy according to the numberof training Epochs. Compared with the first experiment, both trainingsessions using the standard and paired mini-batches were faster becausethe Bayar layer generated forensic features directly. In the case ofMedian, Blur (𝜎 = 0.4) and AWGN (𝜎 = 1, RAW) experiments, the

accuracy difference was reduced compared to the experiment 1, becausethe signal for the image contents was weakened while the signal for theforensic features was enhanced by Bayar filter.

In the Gaussian blur experiment, the experiment was performedwhile reducing the sigma value. A small sigma value means that thechanges of images are subtle, and we could find that the smaller thesigma value, the greater the difference between the training using thestandard mini-batch and using the paired mini-batch. The average valueof differences was 0.057 at 𝜎 = 0.4, 0.098 at 𝜎 = 0.35 and 0.121 at𝜎 = 0.3. This is because the inefficiency of the standard mini-batchincreases as the manipulation signal becomes smaller as compared withthe image content signal.

The bottom of Table 2 shows the results for distinguishing the normaland the manipulated images, additive white Gaussian noise, in theRAW and JPEG compressed images. JPEG compression was performedafter the noise was added to the images, and the normal images werecompressed in the same quality to manipulated images.

In contrast to the Gaussian blur experiment, in the AWGN exper-iment, the difference of the accuracy was small in all of the JPEGcompression quality. It is presumed that when the JPEG compression isperformed, the fine noise signal, which is a high-frequency component,is removed, and both the case of using the standard mini-batch and thecase of using the paired mini-batch have a great difficulty to trainingnetworks. Nevertheless, the average accuracy when using the pairedmini-batch was always greater than when using the standard mini-batchfor all raw and JPEG compressed images.

4.3. Detecting stego images using convnet with a KV high-pass filter

The last experiments was aimed at detecting stego images withembedded secret messages. We added a KV high-pass filter [27] to thefront of the convnet and changed the max pooling layers to the averagepooling layers as described in [20,21].

To ensure the same test conditions as other steganalysis experiments,we used another Bossbase dataset [26] with 10,000 gray images thatwere 512 × 512 pixels. We divided these images into 256 × 256images pixels, and 40,000 normal images were generated. Stego imageswere created using three steganography algorithms and a total of120,000, three sets of 40,000, were created by using S-UNIWARD [28],WOW [29], and Least Significant Bit (LSB) [30]. All three steganographyalgorithms adjusted the pixel values in the spatial domain to insert secretmessages with 0.4 bit per pixel. The learning rate was set to 1e-03.

137

Page 7: Signal Processing: Image Communication - KAISThklee.kaist.ac.kr/publications/Signal Processing Image... · 2018-07-09 · manipulated images. ∙ Differences between normal and manipulated

J.-S. Park et al. Signal Processing: Image Communication 67 (2018) 132–139

Fig. 4. The graphs show the results for distinguishing normal and the stego images generated with S-UNIWARD using the convnet with a KV high-pass filter.

Table 3The maximum accuracy when using the standard mini-batch (STD) and the paired mini-batch (PAIR) to distinguish S-UNIWARD,WOW, and LSB stego images using the convnet with a KV high-pass filter.

∙ S-UNIWARD [28] with 0.4 bit per pixel∙ WOW [29] with 0.4 bit per pixel∙ LSB [30] with 0.4 bit per pixel

Fig. 4 shows the training loss and test accuracy when the convnetwere trained to identify S-UNIWARD stego images. It appears that thelosses were lower and the accuracy was higher when using the pairedmini-batch, and the required number of training iterations to reach anaccuracy of 0.7 was 40.44% when using the paired mini-batch comparedto the number of iterations required for the standard mini-batch.

Table 3 shows the maximum accuracy when using the standardand paired mini-batches for S-UNIWARD, WOW, and LSB stego images.When LSB stego images were detected, the final accuracy was 0.93 inboth cases. However, when using the paired mini-batch, the accuracyincreased much faster compared to when using the standard mini-batch.For WOW stego image detection, there was a slight difference betweenthe accuracy with the two mini-batches, whereas the difference wasclear for S-UNWARD stego image detection. This performance differenceis presumably due to the secret message insertion method.

5. Conclusion

Image forensics and steganalysis have different requirements com-pared to computer vision. Therefore, the convnets used in computervision tend to learn features that represent the contents of images ratherthan forensic or steganalysis features. In this paper, we have shownthat by simply constructing the mini-batch differently, the limitationof training convnets for forensics and steganalysis can be overcome. Wehave explained why using the proposed paired mini-batch is efficient forimage forensics and steganalysis by presenting the training equation,and we have demonstrated the improvements in training speed andaccuracy using the paired mini-batch through various experiments.The average accuracy when using the paired mini-batch was always

greater than when using the standard mini-batch in the same iterationand the results for distinguishing normal and manipulated imagesappear significantly different. Although we experimented with threeforensic manipulation techniques and three steganography algorithms,the proposed paired mini-batch is expected to be broadly application toimage forensics and steganalysis.

Acknowledgment

This work was supported by the Institute for Information & com-munications Technology Promotion (IITP) grant funded by the Koreangovernment (MSIP) (2017-0-01671, Development of high reliabilityimage and video authentication service for smart media environment).

References

[1] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convo-lutional neural networks, in: Advances in Neural Information Processing Systems,2012, pp. 1097–1105.

[2] Y. Taigman, M. Yang, M. Ranzato, L. Wolf, Deepface: Closing the gap to human-levelperformance in face verification, in: Proceedings of the IEEE conference on computervision and pattern recognition, 2014, pp. 1701–1708.

[3] D. Tomè, F. Monti, L. Baroffio, L. Bondi, M. Tagliasacchi, S. Tubaro, Deep convolu-tional neural networks for pedestrian detection, Signal Process., Image Commun. 47(2016) 482–489.

[4] R. Zhao, W. Ouyang, H. Li, X. Wang, Saliency detection by multi-context deeplearning, in: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2015, pp. 1265–1274.

[5] C. Dong, C.C. Loy, K. He, X. Tang, Learning a deep convolutional network forimage super-resolution, in: European Conference on Computer Vision, Springer,2014, pp. 184–199.

[6] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, G.Toderici, Beyond short snippets: Deep networks for video classification, in: Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recognition,2015, pp.4694–4702.

138

Page 8: Signal Processing: Image Communication - KAISThklee.kaist.ac.kr/publications/Signal Processing Image... · 2018-07-09 · manipulated images. ∙ Differences between normal and manipulated

J.-S. Park et al. Signal Processing: Image Communication 67 (2018) 132–139

[7] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, G. Penn, Applying convolutional neuralnetworks concepts to hybrid nn-hmm model for speech recognition, in: Acoustics,Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on,IEEE, 2012, pp. 4277–4280.

[8] N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural network formodelling sentences, 2014, arXiv preprint arXiv:1404.2188.

[9] Y. Kim, Convolutional neural networks for sentence classification, 2014, arXivpreprint arXiv:1408.5882.

[10] S. Zhou, W. Shen, D. Zeng, M. Fang, Y. Wei, Z. Zhang, Spatial–temporal convolutionalneural networks for anomaly detection and localization in crowded scenes, SignalProcess., Image Commun. 47 (2016) 358–368.

[11] J. Li, J. Feng, C.-C.J. Kuo, Deep convolutional neural network for latent fingerprintenhancement, Signal Process., Image Commun. 60 (2018) 52–63.

[12] A. Piva, An overview on image forensics, ISRN Signal Process. 2013 (2013).[13] M.C. Stamm, M. Wu, K.R. Liu, Information forensics: an overview of the first decade,

IEEE Access 1 (2013) 167–200.[14] B. Li, J. He, J. Huang, Y.Q. Shi, A survey on image steganography and steganalysis,

J. Inf. Hiding Multimedia Signal Process. 2 (2011) 142–172.[15] J. Lukas, J. Fridrich, M. Goljan, Digital camera identification from sensor pattern

noise, IEEE Trans. Inf. Forensics Secur. 1 (2006) 205–214.[16] A.C. Popescu, H. Farid, Exposing digital forgeries in color filter array interpolated

images, IEEE Trans. Signal Process. 53 (2005) 3948–3959.[17] H. Gou, A. Swaminathan, M. Wu, Noise features for image tampering detection and

steganalysis, in: Image Processing, 2007. ICIP 2007. IEEE International Conferenceon, vol. 6, IEEE, 2007, pp. VI–97.

[18] T. Pevny, J. Fridrich, Merging markov and dct features for multi-class jpeg steganal-ysis, in: Electronic Imaging 2007, International Society for Optics and Photonics,2007, pp. 650503–650503.

[19] J. Chen, X. Kang, Y. Liu, Z.J. Wang, Median filtering forensics based on convolutionalneural networks, IEEE Signal Process. Lett. 22 (2015) 1849–1853.

[20] Y. Qian, J. Dong, W. Wang, T. Tan, Deep learning for steganalysis via convolutionalneural networks, Media Watermark. Secur. Forensics 9409 (2015) 94090J.

[21] L. Pibre, P. Jérôme, D. Ienco, M. Chaumont, Deep learning for steganalysis is betterthan a rich model with an ensemble classifier, and is natively robust to the coversource-mismatch, 2015, arXiv preprint arXiv:1511.04855.

[22] B. Bayar, M.C. Stamm, A deep learning approach to universal image manipulationdetection using a new convolutional layer, in: Proceedings of the 4th ACM Workshopon Information Hiding and Multimedia Security, ACM, 2016, pp. 5–10.

[23] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016. http://www.deeplearningbook.org.

[24] A. Ng, Cs229 lecture notes, CS229 Lecture Notes 1 (2000) 1–3.[25] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image

recognition, 2014, arXiv preprint arXiv:1409.1556.[26] P. Bas, T. Filler, T. Pevny, Break our steganographic system: The ins and outs of

organizing boss, in: Information Hiding, Springer, 2011, pp. 59–70.[27] J. Fridrich, J. Kodovsky, Rich models for steganalysis of digital images, IEEE Trans.

Inf. Forensics Secur. 7 (2012) 868–882.[28] V. Holub, J. Fridrich, T. Denemark, Universal distortion function for steganography

in an arbitrary domain, EURASIP J. Inf. Secur. 2014 (2014) 1.[29] V. Holub, J. Fridrich, Designing steganographic distortion using directional filters,

in: Information Forensics and Security (WIFS), 2012 IEEE International Workshopon, IEEE, 2012, pp. 234–239.

[30] I.J. Cox, J. Kilian, T. Leighton, T. Shamoon, A secure, robust watermark formultimedia, in: International Workshop on Information Hiding, Springer, 1996,pp. 185–206.

139